# ASSIGNMENTS 2 & 3
## Natural Language Processing
Authors: 
  - Guillermo RUBIO LÓPEZ.
  - Francisco Javier LEITÓN JIMÉNEZ.
  
### Problem Statement:
#### Demographic prediction

When a user visits (one of ) our websites, we collect information about keywords extracted from the website's url. For each user, the frequency of visits per keyword per day is also stored. For example, suppose that a given user has visited the two following sites recently:

html://mypage/abc-news/aaa-bbb.html

html://mypage/news/aaa.html

The keywords (that have been) “seen” by the user will be then stored as follows (semicolon is used to separate words):

abc:1;news:2;aaa:2;bbb:1;mypage:2

Thanks to external data (or some sources of data bought by our marketing department), we have demographic information (like age, sex, race, ...) on about 5% of our visitors. The Head of Product wanted to predict demographics (age, sex) for the rest of our visitors from the keywords collected. He then spoke to Mr. Google, who advised him to hire a talented Master student from ESCP Europe, in order to transform his idea into reality.

The Head of Product's asked you to build a machine learning model to predict age and sex for each line in our dataset, which was partially extracted from one month's data (the portion of each day's data was concatenated). The dataset contains two files named train.csv (to help you train your model) and test.csv. Its format looks like: userID, keywords, age, sex (comma is used as a delimiter). Note that there are some missing data in our dataset, and we removed all the “labels” (age, sex) from the test file.

Once your model is built, you have to use the test.csv file to test your model, and send us the results as a csv file containing only three columns: ID, age_pred, sex_pred. For example, your submission file should look like:

ID,age_pred,sex_pred 
1,35,F
2,45,M
...

You must also send us your solution/code via github. 
Please let me know if you have any questions or concerns.



### CAUTION: USING THE FULL DATASETS MIGHT CRASH YOUR PYTHON KERNEL
For avoiding this problem of crashing the kernel, please use the following variables to extract a subset of the datasets:
  - training_selection.
  - test_selection.


### The code of the Assignment begins here:

In [175]:
# Import libraries
import numpy as np
import pandas as pd
import gensim as gm
import tensorflow as tf
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [176]:
# Extract a subset from the HUGE training set by using the variable training_selection
training_selection = 100000
# Extract a subset from the HUGE test set by using the variable test_selection
test_selection = training_selection*0.25

In [177]:
# Load the training dataset
train = pd.read_csv('train.csv')
train = train.loc[1:training_selection]
print("Info")
train.info()
print("Describe")
train.describe()

Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 1 to 100000
Data columns (total 4 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   ID        100000 non-null  int64 
 1   keywords  88345 non-null   object
 2   age       100000 non-null  int64 
 3   sex       100000 non-null  object
dtypes: int64(2), object(2)
memory usage: 3.1+ MB
Describe


Unnamed: 0,ID,age
count,100000.0,100000.0
mean,1244600.0,46.17831
std,1783389.0,13.13868
min,1.0,14.0
25%,179811.2,37.0
50%,463794.5,45.0
75%,1494604.0,56.0
max,10375940.0,98.0


In [178]:
# Load the test dataset
test = pd.read_csv('test.csv')
test = test.loc[1:test_selection]
print("Info")
test.info()
print("Describe")
test.describe()

Info
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 1 to 25000
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   ID        25000 non-null  int64  
 1   keywords  22130 non-null  object 
 2   age       0 non-null      float64
 3   sex       0 non-null      float64
dtypes: float64(2), int64(1), object(1)
memory usage: 781.4+ KB
Describe


Unnamed: 0,ID,age,sex
count,25000.0,0.0,0.0
mean,1560071.0,,
std,898305.1,,
min,104.0,,
25%,777566.2,,
50%,1571746.0,,
75%,2335324.0,,
max,3111776.0,,


In [179]:
# Print Null values for the training dataset
train.isnull().sum(axis = 0)

ID              0
keywords    11655
age             0
sex             0
dtype: int64

In [180]:
# Print Null values for the test dataset
test.isnull().sum(axis = 0)

ID              0
keywords     2870
age         25000
sex         25000
dtype: int64

In [181]:
# Print heads
train.head()

Unnamed: 0,ID,keywords,age,sex
1,361410,forum:3;contrat:1;calcul:3;conges:1;mission:4;...,47,M
2,211450,villa:1;location:2;aquitaine:2;maison:1;vacanc...,61,F
3,1368807,trafic:1;tournante:1;drogue:1;france:1;plaque:...,45,M
4,3502570,trafic:1;septembre:1;greve:1;sncf:1;sortir:1;p...,22,M
5,2027488,darmanin:1;pour:1;ferme:1;jcms:1;conjoncture:1...,55,M


In [182]:
test.head()

Unnamed: 0,ID,keywords,age,sex
1,2684755,programme:1;qui:1;coupee:1;television:1;montag...,,
2,130714,f75875b5:1;signin:1;signout:1;preavis:1;29d4:1...,,
3,338096,qui:1;les:1;embarrasse:1;international:1;democ...,,
4,2417963,lycee:1;photo:1;raspail:1;ledez:1;annabelle:1;...,,
5,189334,affich:1;forum:1;quel:1;choisir:1,,


In [183]:
# Create the final output from the dataset
finalDF = pd.DataFrame()
finalDF['ID'] = test['ID'].values
finalDF.head()

Unnamed: 0,ID
0,2684755
1,130714
2,338096
3,2417963
4,189334


#### Data Cleaning section
In this section we will proceed with cleaning the dataset since we hace several values with missing information and other that doesn't provide any information at all for our purpose which it is to predict the Age and Sex depending on the given set of keywords obtained.

Since the empty values can also give us a hint of the age and sex of the person, we will replace the NaN value with a standardized value of ":0".

In [184]:
# Drop the id columns and also in the test dataset the age and sex.
train.drop(['ID'], axis = 1, inplace = True)
test.drop(['ID'], axis = 1, inplace = True)
test.drop(['age'], axis = 1, inplace = True)
test.drop(['sex'], axis = 1, inplace = True)

In [185]:
train.head()

Unnamed: 0,keywords,age,sex
1,forum:3;contrat:1;calcul:3;conges:1;mission:4;...,47,M
2,villa:1;location:2;aquitaine:2;maison:1;vacanc...,61,F
3,trafic:1;tournante:1;drogue:1;france:1;plaque:...,45,M
4,trafic:1;septembre:1;greve:1;sncf:1;sortir:1;p...,22,M
5,darmanin:1;pour:1;ferme:1;jcms:1;conjoncture:1...,55,M


In [186]:
test.head()

Unnamed: 0,keywords
1,programme:1;qui:1;coupee:1;television:1;montag...
2,f75875b5:1;signin:1;signout:1;preavis:1;29d4:1...
3,qui:1;les:1;embarrasse:1;international:1;democ...
4,lycee:1;photo:1;raspail:1;ledez:1;annabelle:1;...
5,affich:1;forum:1;quel:1;choisir:1


In [187]:
stop_words = stopwords.words()
porter = PorterStemmer()
def cleantext(counttext):
    wordcounts=counttext.split(";")
    text=""
    for words in wordcounts:
        wordocr=words.split(":")
        if len(wordocr)>1 and not wordocr[0] in stop_words :
            wordocr[0]=porter.stem(wordocr[0])
            text+=(wordocr[0]+" ")*int(wordocr[1])
    return text

In [188]:
# Fill missing and repeating values for the testing dataset
test['keywords'].fillna(":0",inplace=True)
test=test[test['keywords'].str.contains(":")]
test['keywords']=test['keywords'].map(lambda x: cleantext(x))

In [189]:
# Clean numbers of the websites since are not relevant
test['keywords']=test['keywords'].str.replace(':[0-9]*', '',regex=True)
test['keywords']=test['keywords'].str.replace('[\_+-]', ';',regex=True)
test['keywords']=test['keywords'].str.replace('([A-Za-z0-9])*([%=?#|<>^*()+_]+)+([A-Za-z0-9])*', '',regex=True)
test['keywords']=test['keywords'].str.replace('([\dA-Za-z]*)(\d)+([\dA-Za-z]*)', '',regex=True)
test['keywords']=test['keywords'].str.replace(';+', ';',regex=True)
test.head()

Unnamed: 0,keywords
1,programm coupe televis montag fauss angot mala...
2,signin signout preavi essai ruptur news d...
3,embarrass intern democr clinton hillari livr
4,lyce photo raspail ledez annabel brest anni ta...
5,affich forum quel choisir


In [190]:
# Fill missing values and repeating for the training dataset
train['keywords'].fillna(":0",inplace=True)
train=train[train['keywords'].str.contains(":")]
train['keywords']=train['keywords'].map(lambda x: cleantext(x))
train.head()

Unnamed: 0,keywords,age,sex
1,forum forum forum contrat calcul calcul calcul...,47,M
2,villa locat locat aquitain aquitain maison vac...,61,F
3,trafic tournant drogu franc plaqu actualit,45,M
4,trafic septembr greve sncf sortir perturb maga...,22,M
5,darmanin ferm jcm conjonctur guichet p1_169806...,55,M


In [191]:
# Clean numbers of the websites since are not relevant
train['keywords']=train['keywords'].str.replace(':[0-9]*', '',regex=True)
train['keywords']=train['keywords'].str.replace('[\_+-]', ';',regex=True)
train['keywords']=train['keywords'].str.replace('([A-Za-z0-9])*([%=?#|<>^*()+_]+)+([A-Za-z0-9])*', '',regex=True)
train['keywords']=train['keywords'].str.replace('([\dA-Za-z]*)(\d)+([\dA-Za-z]*)', '',regex=True)
train['keywords']=train['keywords'].str.replace(';+', ';',regex=True)
train.head()

Unnamed: 0,keywords,age,sex
1,forum forum forum contrat calcul calcul calcul...,47,M
2,villa locat locat aquitain aquitain maison vac...,61,F
3,trafic tournant drogu franc plaqu actualit,45,M
4,trafic septembr greve sncf sortir perturb maga...,22,M
5,darmanin ferm jcm conjonctur guichet ; consomm...,55,M


#### Predicting Gender
Once we have cleaned the dataset, we can start working in order to obtain the predicted values that we want.

In [192]:
# Select the Sex column
y_train_sex = train['sex'].values
y_train_sex

array(['M', 'F', 'M', ..., 'M', 'F', 'F'], dtype=object)

In [193]:
# Get the keywords
X_train_keywords = train['keywords'].values
X_train_keywords

array(['forum forum forum contrat calcul calcul calcul cong mission mission mission mission interim interim fin fin fin fin paiement paiement indemnit indemnit indemnit affich affich affich faq faq cdi droit regl regl pay ifm ifm ifm ifm ifm ',
       'villa locat locat aquitain aquitain maison vacanc vacanc girond girond franc franc ',
       'trafic tournant drogu franc plaqu actualit ', ...,
       'alain. profil ', 'emploi lesquel miser job faut ',
       'regl ba enregistrement;de;livre;audio grossistes.shtml appl appl peintur macron macron macron macron macron macron macron macron prix presid cultur comment deput deput deput deput metropol affich affich affich faill desinscrir insupport insupport plaqu reunion troi troi troi saisir saisir saisir pari pari pari articl articl cadeau cadeau cadeau cadeau cadeau cadeau cadeau actu actu direct direct detail detail marie supprim editori editori maeli musiqu musiqu commerc commerc commerc commerc commerc temoign temoign temoign temoign 

In [194]:
# Split the training set into train and test
X_train, X_test, y_train, y_test = train_test_split(X_train_keywords, y_train_sex, test_size=0.25, random_state=10)

In [195]:
X_test

array(['ordinateur detect forum affich ecran ', 'list pilul   gener faq ',
       '', ...,
       'bizutag traitement telecharg programm scleros peni dictionnair symptom bout caus sensat maker definit forum power faq faq window movi  plaqu affich desagr diagnost download insensibilit doigt ',
       'prophas definit definit faq faq mitos ',
       'affich cuiss gauch engourdiss forum '], dtype=object)

In [196]:
y_train

array(['M', 'M', 'M', ..., 'M', 'F', 'M'], dtype=object)

In [197]:
# Encode the labels 0 for F and 1 for M.
encoder = LabelBinarizer()
encoder.fit(y_train)
y_train = encoder.transform(y_train)
y_test = encoder.transform(y_test)

In [198]:
# Labels encoded
y_train

array([[1],
       [1],
       [1],
       ...,
       [1],
       [0],
       [1]])

In [199]:
# Import some libraries from tensorflow
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing import sequence
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout,Bidirectional, Activation, MaxPooling1D
from tensorflow.keras.layers import Embedding, LSTM
from tensorflow.keras.layers import Conv1D, Flatten

In [200]:
# Parameters of the Neural Network
max_len = 50
vocab_size = 15000
embedding_vector_length = 32

In [201]:
# Creation of the tokenizer and turn the text into numbers
tokenizer = Tokenizer(num_words=vocab_size,split=' ')
tokenizer.fit_on_texts(X_train)

In [202]:
# Encode the words into  Matrix form
X_train = tokenizer.texts_to_matrix(X_train)

In [203]:
X_test = tokenizer.texts_to_matrix(X_test)

In [204]:
# Convert all the lists to the same size for the train and test dataset
X_train = sequence.pad_sequences(X_train, maxlen=max_len)
X_train.shape

(75000, 50)

In [205]:
X_test = sequence.pad_sequences(X_test, maxlen=max_len)
X_test.shape

(25000, 50)

In [206]:
# Check all the words in the tokenizer
tokenizer.word_index.items()

dict_items([('star', 1), ('peopl', 2), ('magazin', 3), ('week', 4), ('chez', 5), ('franc', 6), ('emploi', 7), ('photo', 8), ('meteo', 9), ('actu', 10), ('previs', 11), ('actualit', 12), ('offr', 13), ('recett', 14), ('celebrit', 15), ('affich', 16), ('forum', 17), ('flash', 18), ('detail', 19), ('annonc', 20), ('societ', 21), ('pari', 22), ('macron', 23), ('faq', 24), ('info', 25), ('plu', 26), ('style', 27), ('news', 28), ('politiqu', 29), ('gastronomi', 30), ('beaut', 31), ('scoop', 32), ('interieur', 33), ('mode', 34), ('defil', 35), ('deco', 36), ('televis', 37), ('programm', 38), ('comment', 39), ('auto', 40), ('definit', 41), ('maison', 42), ('intern', 43), ('jean', 44), ('imag', 45), ('ete', 46), ('articl', 47), ('cinema', 48), ('automn', 49), ('dictionnair', 50), ('mireil', 51), ('darc', 52), ('cuisin', 53), ('secret', 54), ('femm', 55), ('nouveau', 56), ('shtml', 57), ('coiffur', 58), ('hiver', 59), ('sortir', 60), ('bricolag', 61), ('even', 62), ('etr', 63), ('saint', 64), ('

In [207]:
# Model creation
model = Sequential()

# Embedding in Keras with a vocab size 15000 words and with of 50 dimensions to 32 embedded dimensions.
model.add(Embedding(vocab_size, output_dim=embedding_vector_length, input_length=max_len, trainable=True))

# Hidden layers

# ANN 1 + BinaryCrossentropy
#model.add(Bidirectional(tf.keras.layers.LSTM(embedding_vector_length)))
#model.add(Dropout(0.2))
#model.add(Dense(embedding_vector_length, activation='relu'))
#model.add(Dense(units=1, activation='sigmoid'))

# ANN 2 + BinaryCrossentropy
#model.add(Flatten())
#model.add(Dense(embedding_vector_length, activation='relu'))
#model.add(Dense(1))

# ANN 3 + sparse_categorical_crossentropy
model.add(LSTM(embedding_vector_length))
model.add(Dense(units=2, activation='softmax'))


model.summary()

# Compile the model
#model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),optimizer='adam',metrics=['accuracy'])
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
history = model.fit(X_train, y_train, epochs=6)

Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 50, 32)            480000    
_________________________________________________________________
lstm_3 (LSTM)                (None, 32)                8320      
_________________________________________________________________
dense_17 (Dense)             (None, 2)                 66        
Total params: 488,386
Trainable params: 488,386
Non-trainable params: 0
_________________________________________________________________
Epoch 1/6
Epoch 2/6
Epoch 3/6
Epoch 4/6
Epoch 5/6
Epoch 6/6


In [208]:
# Prediction from the test sample coming from the training test
pred_test = model.predict(X_test)

In [209]:
# Score and evaluations
score = model.evaluate(X_test, y_test,batch_size=32, verbose=1) 
print('Test accuracy:', score[1])


Test accuracy: 0.5575600266456604


In [210]:
# Predictions
pred_test

array([[0.4246543 , 0.57534575],
       [0.4246543 , 0.57534575],
       [0.4246543 , 0.57534575],
       ...,
       [0.4246543 , 0.57534575],
       [0.4246543 , 0.57534575],
       [0.4246543 , 0.57534575]], dtype=float32)

In [211]:
# Real values
y_test

array([[1],
       [0],
       [1],
       ...,
       [0],
       [1],
       [1]])

In [212]:
# Decode labels
predicted_labels = encoder.inverse_transform(pred_test)
results = pd.DataFrame(predicted_labels)
results.info()
results.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       25000 non-null  object
dtypes: object(1)
memory usage: 195.4+ KB


Unnamed: 0,0
count,25000
unique,1
top,M
freq,25000


After testing the model, the next step is to use it to predict from the test dataset file provided by the professor.

In [213]:
# Split the dataset file
# CAUTION: Kernel might die if you use the whole test dataset...
test.keywords = test.keywords.astype(str)
test_keywords = test['keywords'].values

In [214]:
# Convert the keywords to matrix and make the predictions.
test_keywords_matrix = tokenizer.texts_to_matrix(test_keywords)
test_keywords_matrix = sequence.pad_sequences(test_keywords_matrix, maxlen=max_len)
predictions = model.predict(test_keywords_matrix)

In [215]:
# Decode labels
predicted_gender = encoder.inverse_transform(predictions)
pred_gender_DF = pd.DataFrame()
pred_gender_DF['gender_pred'] = predicted_gender
pred_gender_DF.info()
pred_gender_DF.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 1 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   gender_pred  25000 non-null  object
dtypes: object(1)
memory usage: 195.4+ KB


Unnamed: 0,gender_pred
count,25000
unique,1
top,M
freq,25000


As we can see from the predictions, using a Neural Network for predicting the values hasn't given us good results. This results could be improved using a larger dataset, instead of just a sample as we did. The accuracy obtained is about 55%.

#### Predicting Age
After trying to predict the Gender from the keywords using an Artificial Neural Network, we can proceed with the Age prediction...

In [216]:
# Get the keywords
X_train_keywords = train['keywords'].values
X_train_keywords

array(['forum forum forum contrat calcul calcul calcul cong mission mission mission mission interim interim fin fin fin fin paiement paiement indemnit indemnit indemnit affich affich affich faq faq cdi droit regl regl pay ifm ifm ifm ifm ifm ',
       'villa locat locat aquitain aquitain maison vacanc vacanc girond girond franc franc ',
       'trafic tournant drogu franc plaqu actualit ', ...,
       'alain. profil ', 'emploi lesquel miser job faut ',
       'regl ba enregistrement;de;livre;audio grossistes.shtml appl appl peintur macron macron macron macron macron macron macron macron prix presid cultur comment deput deput deput deput metropol affich affich affich faill desinscrir insupport insupport plaqu reunion troi troi troi saisir saisir saisir pari pari pari articl articl cadeau cadeau cadeau cadeau cadeau cadeau cadeau actu actu direct direct detail detail marie supprim editori editori maeli musiqu musiqu commerc commerc commerc commerc commerc temoign temoign temoign temoign 

In [217]:
# Select the Sex column
y_train_age = train['age'].values
y_train_age

array([47, 61, 45, ..., 29, 21, 34])

In [218]:
#We convert text into TFIDF feature values
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
tfidfconverter = TfidfVectorizer(max_features=1500, min_df=5, max_df=0.7, stop_words=stopwords.words('english') + stopwords.words('french'))
X = tfidfconverter.fit_transform(X_train_keywords).toarray()

In [219]:
#we split our data into training and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y_train_age, test_size=0.2, random_state=0)

In [220]:
# Convert all the lists to the same size for the train and test dataset
X_train = sequence.pad_sequences(X_train, maxlen=max_len)
X_train.shape


(80000, 50)

In [221]:

X_test = sequence.pad_sequences(X_test, maxlen=max_len)
X_test.shape

(20000, 50)

We are going to use several models to see which of them performs the best in predicting the age. Let's start with Neural Network

In [225]:
# Parameters of the Neural Network
max_len = 50
vocab_size = 15000
embedding_vector_length = 32


# Model creation
model_age = Sequential()

# Embedding in Keras with a vocab size 15000 words and with of 50 dimensions to 32 embedded dimensions.
model_age.add(Embedding(vocab_size, output_dim=embedding_vector_length, input_length=max_len, trainable=True))

# Hidden layers

# ANN 1 + BinaryCrossentropy
#model.add(Bidirectional(tf.keras.layers.LSTM(embedding_vector_length)))
#model.add(Dropout(0.2))
#model.add(Dense(embedding_vector_length, activation='relu'))
#model.add(Dense(units=1, activation='sigmoid'))

# ANN 2 + BinaryCrossentropy
model_age.add(Flatten())
model_age.add(Dense(embedding_vector_length, activation='relu'))
model_age.add(Dense(16, activation='relu'))
model_age.add(Dense(8, activation='relu'))
model_age.add(Dropout(0.2))
model_age.add(Dense(1))

# ANN 3 + sparse_categorical_crossentropy 
#model_age.add(LSTM(embedding_vector_length))
#model_age.add(Dense(units=100, activation='softmax'))




model_age.summary()

# Compile the model
#model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),optimizer='adam',metrics=['accuracy'])
model_age.compile(loss='mean_absolute_error',optimizer='adam',metrics=['accuracy'])
history = model_age.fit(X_train, y_train, epochs=10)

Model: "sequential_15"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 50, 32)            480000    
_________________________________________________________________
flatten_5 (Flatten)          (None, 1600)              0         
_________________________________________________________________
dense_22 (Dense)             (None, 32)                51232     
_________________________________________________________________
dense_23 (Dense)             (None, 16)                528       
_________________________________________________________________
dense_24 (Dense)             (None, 8)                 136       
_________________________________________________________________
dropout_2 (Dropout)          (None, 8)                 0         
_________________________________________________________________
dense_25 (Dense)             (None, 1)               

In [226]:
# Prediction from the test sample coming from the training test
pred_test_age = model_age.predict(X_test)

In [227]:
# Score and evaluations
score = model_age.evaluate(X_test, y_test,batch_size=32, verbose=1) 
print('Test accuracy:', score[1])


Test accuracy: 0.0


Since we obtained 0% of accuracy using a Neural Network, then, it will be necessary to try with another models, such as Random Forest and Gradient Boost.

In [228]:
#we try now with random forest regressor algorithm
from sklearn.ensemble import RandomForestRegressor

regressor = RandomForestRegressor(n_estimators=20, random_state=0)
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

In [229]:
y_pred

array([46.19957137, 46.19957137, 46.19957137, ..., 46.19957137,
       46.19957137, 46.19957137])

In [230]:
y_train

array([45, 29, 56, ..., 55, 39, 52])

In [231]:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.0008582079578426383

In [232]:
#we are going to try also with random forest classifier
from sklearn.ensemble import RandomForestClassifier
classif = RandomForestClassifier(n_estimators = 1000, random_state = 0)
classif.fit(X_train, y_train)


RandomForestClassifier(n_estimators=1000, random_state=0)

In [233]:
y_pred_classif = classif.predict(X_test)

In [234]:
y_pred_classif

array([42, 42, 42, ..., 42, 42, 42])

In [241]:
#gradient boosting classifier
from sklearn.ensemble import GradientBoostingClassifier

lr_list = [0.05, 0.1, 0.5, 1] #lr_list = [0.05, 0.075, 0.1, 0.25, 0.5, 0.75, 1]

for learning_rate in lr_list:
    gb_clf = GradientBoostingClassifier(n_estimators=20, learning_rate=learning_rate, max_features=2, max_depth=2, random_state=0)
    gb_clf.fit(X_train, y_train)

    print("Learning rate: ", learning_rate)
    print("Accuracy score (training): {0:.3f}".format(gb_clf.score(X_train, y_train)))

Learning rate:  0.05
Accuracy score (training): 0.031
Learning rate:  0.1
Accuracy score (training): 0.031
Learning rate:  0.5
Accuracy score (training): 0.031
Learning rate:  1
Accuracy score (training): 0.031


As an overall conclusion we can take from these models, is that they are not really good for predicting the age using keywords.
The accuracies found in the models are very low, which translates into really poor predicting results.

We are now going to predict the age in the test dataset, using the randomforestclassifier model for this purpose

In [242]:
#X_age_test = tfidfconverter.fit_transform(test_keywords).toarray()
#test_keywords_matrix = sequence.pad_sequences(X_age_test, maxlen=max_len)
predictions_age = classif.predict(test_keywords_matrix)

In [243]:
predictions_age

array([42, 42, 42, ..., 42, 42, 42])

In [244]:
pred_age_DF = pd.DataFrame()
pred_age_DF['age_pred'] = predictions_age
pred_age_DF.info()
pred_age_DF.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 1 columns):
 #   Column    Non-Null Count  Dtype
---  ------    --------------  -----
 0   age_pred  25000 non-null  int64
dtypes: int64(1)
memory usage: 195.4 KB


Unnamed: 0,age_pred
count,25000.0
mean,41.99944
std,0.155437
min,33.0
25%,42.0
50%,42.0
75%,42.0
max,60.0


#### Printing to CSV
This section prints the dataframe to a CSV file.

In [245]:
finalDF= pd.concat([pred_gender_DF, pred_age_DF], axis=1)
finalDF.to_csv('predictions.csv')

### Below, you can find another two other ways to convert the keywords into integers that we also considered using during the analysis

In [18]:
#Using CountVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectors = vectorizer.fit_transform(X_train_keywords)

In [None]:
#Using Label_encoder
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(X_train_keywords)
print(integer_encoded)