# Machine Learning Project: French Grand Debat

### Project Introduction :

First, we would like to choose a problematic according to the data we collected. The aim of this project could be to answer the following questions:

* What are the 5 most important topics on each theme?
* Can we build different profiles of people with their ideas on the 4 themes?
* Could we complete the profile of someone who didn't answer questions on a certain theme by comparison with a profile build with the second question?
 
#### Doability :

We are going to import the data, then check if the ids of people who submitted ideas on different themes are the same. If not, we won't be able to answer the second question.


In [19]:
# librairies imported
import src.utils as ut #read data
import numpy as np

#### Data extraction

In [20]:
df_fiscalite = ut.read_data('data/LA_FISCALITE_ET_LES_DEPENSES_PUBLIQUES.json')
df_democratie = ut.read_data('data/DEMOCRATIE_ET_CITOYENNETE.json')
df_ecologie = ut.read_data('data/LA_TRANSITION_ECOLOGIQUE.json')
df_organisation = ut.read_data('data/ORGANISATION_DE_LETAT_ET_DES_SERVICES_PUBLICS.json')

In [21]:
np.random.seed(1)
for i in np.random.randint(len(df_fiscalite), size=5):
    auth = df_fiscalite.loc[i, 'authorId']
    print("Author ID : " + auth)

    dfs = np.array([["fiscalite", df_fiscalite], ["democratie", df_democratie], ["ecologie", df_ecologie], ["organisation", df_organisation]])
    for df in dfs:
        code = df[1].loc[df[1]['authorId'] == auth, 'authorZipCode']
        if(len(code) > 0):
            code = code.values[0]
            print("* In " + df[0] + " survey, author has zip code : " + str(code))
        else:
            print("* In " + df[0] + " survey, author has not answered...")
    print("\n############################\n")

Author ID : VXNlcjoxMjViYWQ4Yi0xZmM0LTExZTktOTRkMi1mYTE2M2VlYjExZTE=
* In fiscalite survey, author has zip code : 93320
* In democratie survey, author has not answerd...
* In ecologie survey, author has not answerd...
* In organisation survey, author has not answerd...

############################

Author ID : VXNlcjplYWEyMzA2MC0xZGEzLTExZTktOTRkMi1mYTE2M2VlYjExZTE=
* In fiscalite survey, author has zip code : 82290
* In democratie survey, author has zip code : 82290
* In ecologie survey, author has not answerd...
* In organisation survey, author has zip code : 82290

############################

Author ID : VXNlcjo5ZjllMTFiZS0xYTQ3LTExZTktOTRkMi1mYTE2M2VlYjExZTE=
* In fiscalite survey, author has zip code : 59700
* In democratie survey, author has not answerd...
* In ecologie survey, author has not answerd...
* In organisation survey, author has not answerd...

############################

Author ID : VXNlcjplZjFhMGViMS0xZTU4LTExZTktOTRkMi1mYTE2M2VlYjExZTE=
* In fiscalite survey, a

**From the previous lines, we can see that the `authorId` is likely to be an unique id whaterver the dataframe (zip code is the same).**

### How many people answered several themes ?

In order to build type-profiles, we need a large number of people who answered questions on several themes. That is what we will try to find with the following code.

**Be careful, this cell takes a lot of time to run !**

In [22]:
# allAuthIds is the sets of all the authorIds
allAuthIds = []
for i in range(4):
    allAuthIds.extend(set(dfs[i,1]['authorId'].values))
allAuthIds = set(allAuthIds)

# all_auth_id_array is the sorted array of all the authorIds
all_auth_id_array = np.sort(np.array(list(allAuthIds)))

# auth_answers_count[i,j] is 1 if all_auth_id_array[i] has answered survey dfs[j]
auth_answers_count = np.zeros((len(allAuthIds), 4), dtype=int)
for j in range(4):
    for i in range(len(all_auth_id_array)):
        auth = all_auth_id_array[i]
        line = dfs[j,1].loc[dfs[j,1]['authorId'] == auth]
        if(len(line) > 0):
            auth_answers_count[i,j] = auth_answers_count[i,j] + 1
            
print("auth_answers_count :")
print(auth_answers_count)

# number_of_survey_taken[i] is the number of survey answered by all_auth_id_array[i]
number_of_survey_taken = np.sum(auth_answers_count, axis=1)
# number_of_participants_by_survey[i] is the number of participants to survey dfs[j]
number_of_participants_by_survey = np.sum(auth_answers_count, axis=0)

print("#######################")
print("number of participant by survey :")
for i in range(4):
    print(dfs[i,0] + " : " + str(number_of_participants_by_survey[i]))

# number_of_participant_to_several_surveys[i] is the number of participants that have
# answerd to i surveys out of the 4 (0<i<5)
number_of_participant_to_several_surveys = np.bincount(number_of_survey_taken)

print("#######################")
print("number of participant to x surveys :")
for i in range(5):
    print(str(number_of_participant_to_several_surveys[i]) + " people have participed to "
          + str(i) + " different surveys.")
    

auth_answers_count :
[[0 0 1 0]
 [1 1 1 0]
 [0 0 0 1]
 ...
 [0 1 0 0]
 [1 0 0 0]
 [0 0 1 0]]
#######################
number of participant by survey :
fiscalite : 54609
democratie : 32800
ecologie : 42963
organisation : 35328
#######################
number of participant to x surveys :
0 people have participed to 0 different surveys.
54970 people have participed to 1 different surveys.
17552 people have participed to 2 different surveys.
10038 people have participed to 3 different surveys.
11378 people have participed to 4 different surveys.


### Resume

Total number of participants by survey: /n
fiscalite: 54609
democratie: 32800
ecologie: 42963
organisation: 35328

54970 people have participated to only one survey
17552 people have participated to 2 surveys
10038 people have participated to 3 survey
11378 people have participated to 4 survey

These preliminaries let us think that what we wanted to do for this project is doable, because the number of people who answered on several themes is pretty large. We are now looking for a method to build these type-profiles.