# Machine Learning Project: French Grand Debat

### Project Introduction :

First, we would like to choose a problematic according to the data we collected. The aim of this project could be to answer the following questions:

* What are the 5 most important topics on each theme?
* Can we build different profiles of people with their ideas on the 4 themes?
* Could we complete the profile of someone who didn't answer questions on a certain theme by comparison with a profile build with the second question
 
#### Doability :

We are going to import the data, then check if the ids of people who submitted ideas on different themes are the same. If not, we won't be able to answer the second question.


In [1]:
# librairies imported
import src.utils as ut #read data
import numpy as np

#### Data extraction

In [2]:
df_fiscalite = ut.read_data('data/LA_FISCALITE_ET_LES_DEPENSES_PUBLIQUES.json')
df_democratie = ut.read_data('data/DEMOCRATIE_ET_CITOYENNETE.json')
df_ecologie = ut.read_data('data/LA_TRANSITION_ECOLOGIQUE.json')
df_organisation = ut.read_data('data/ORGANISATION_DE_LETAT_ET_DES_SERVICES_PUBLICS.json')

In [3]:
np.random.seed(1)
for i in np.random.randint(len(df_fiscalite), size=5):
    auth = df_fiscalite.loc[i, 'authorId']
    print("Author ID : " + auth)

    dfs = np.array([["fiscalite", df_fiscalite], ["democratie", df_democratie], ["ecologie", df_ecologie], ["organisation", df_organisation]])
    for df in dfs:
        code = df[1].loc[df[1]['authorId'] == auth, 'authorZipCode']
        if(len(code) > 0):
            code = code.values[0]
            print("* In " + df[0] + " survey, author has zip code : " + str(code))
        else:
            print("* In " + df[0] + " survey, author has not answerd...")
    print("\n############################\n")

Author ID : VXNlcjoxMjViYWQ4Yi0xZmM0LTExZTktOTRkMi1mYTE2M2VlYjExZTE=
* In fiscalite survey, author has zip code : 93320
* In democratie survey, author has not answerd...
* In ecologie survey, author has not answerd...
* In organisation survey, author has not answerd...

############################

Author ID : VXNlcjplYWEyMzA2MC0xZGEzLTExZTktOTRkMi1mYTE2M2VlYjExZTE=
* In fiscalite survey, author has zip code : 82290
* In democratie survey, author has zip code : 82290
* In ecologie survey, author has not answerd...
* In organisation survey, author has zip code : 82290

############################

Author ID : VXNlcjo5ZjllMTFiZS0xYTQ3LTExZTktOTRkMi1mYTE2M2VlYjExZTE=
* In fiscalite survey, author has zip code : 59700
* In democratie survey, author has not answerd...
* In ecologie survey, author has not answerd...
* In organisation survey, author has not answerd...

############################

Author ID : VXNlcjplZjFhMGViMS0xZTU4LTExZTktOTRkMi1mYTE2M2VlYjExZTE=
* In fiscalite survey, a

**From the previous lines, we can see that the `authorId` is likly to be an unique id whaterver the dataframe (zip code is the same).**

### How many people answerd the 4 themes ?

In [16]:
allAuthIds = []
for i in range(4):
    allAuthIds.extend(set(dfs[i,1]['authorId'].values))
    dfs[i,1].sort_values('authorId')
    print(len(allAuthIds))
allAuthIds = set(allAuthIds)
all_auth_id_array = np.sort(np.array(list(allAuthIds)))
print(len(all_auth_id_array))
print(all_auth_id_array)
print(dfs[1,1])

for i in range(4):
    print()
    #dfs[]

auth_answers_count = np.zeros((len(allAuthIds), 4))
for j in range(4):
    for i in range(len(all_auth_id_array)):
        auth = all_auth_id_array[i]
        line = dfs[j,1].loc[dfs[j,1]['authorId'] == auth]
        if(len(line) > 0):
            auth_answers_count[i,j] = auth_answers_count[i,j] + 1

print(auth_answers_count)

54609
87409
130372
165700
93938
['VXNlcjo0M2E0MTFiYy0yMWFlLTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjo0M2E0MTcyZS0yNzliLTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjo0M2E0MWQ2ZS0yOTI0LTExZTktOTRkMi1mYTE2M2VlYjExZTE=' ...
 'VXNlcjpmZmZmOThhOS0xYTI0LTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjpmZmZmZGYxZC0xZmRlLTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjpmZmZmZTg1NC0xZWVmLTExZTktOTRkMi1mYTE2M2VlYjExZTE=']
                                                authorId  \
0      VXNlcjo4Mjc4NzQxYS0xZTFkLTExZTktOTRkMi1mYTE2M2...   
1      VXNlcjo4OWQ3MzE5My0xZDYwLTExZTktOTRkMi1mYTE2M2...   
2      VXNlcjowMzYyMTUyNy0xZDEyLTExZTktOTRkMi1mYTE2M2...   
3      VXNlcjo4YWJlYzBmOS0xZGE3LTExZTktOTRkMi1mYTE2M2...   
4      VXNlcjo2Nzc5MjE4OC0xZTIxLTExZTktOTRkMi1mYTE2M2...   
5      VXNlcjpmYmNjODEwNS0xZDkyLTExZTktOTRkMi1mYTE2M2...   
6      VXNlcjo5NmNhYWM4ZS0xZTIwLTExZTktOTRkMi1mYTE2M2...   
7      VXNlcjo0ZTQxY2UxZi0xZTIxLTExZTktOTRkMi1mYTE2M2...   
8      VXNlcjo4MjgzZTUyNi0xZDU5LTExZTktOTRkMi1mYTE2M2...   
9  

KeyboardInterrupt: 

In [10]:
print(all_auth_id_array)
print(np.sort(all_auth_id_array))

['VXNlcjo0MjZlMTljOC0yMGE3LTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjpiM2Y3ODdmMS0yMjE3LTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjo2ODNlOWYzOC0yODQzLTExZTktOTRkMi1mYTE2M2VlYjExZTE=' ...
 'VXNlcjo0OWQxNzA2YS0yMDlhLTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjpmMzQ1OWFmZi0xZDc0LTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjpiNzgxZTNjYy0xZjNkLTExZTktOTRkMi1mYTE2M2VlYjExZTE=']
['VXNlcjo0M2E0MTcyZS0yNzliLTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjo0M2E0ZWNkNy0xZDg1LTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjo0M2E0ZWUzNC0yOTNmLTExZTktOTRkMi1mYTE2M2VlYjExZTE=' ...
 'VXNlcjpmZmZkYzZmOC0yMDhlLTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjpmZmZkZDdkYy0yOTZlLTExZTktOTRkMi1mYTE2M2VlYjExZTE='
 'VXNlcjpmZmZmZGYxZC0xZmRlLTExZTktOTRkMi1mYTE2M2VlYjExZTE=']
