# Regression analyisis
In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable) and one or more independent variables (often called 'predictors', 'covariates', 'explanatory variables' or 'features'). The most common form of regression analysis is linear regression, in which one finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line (or hyperplane) that minimizes the sum of squared differences between the true data and that line (or hyperplane). For specific mathematical reasons (see linear regression), this allows the researcher to estimate the conditional expectation (or population average value) of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters (e.g., quantile regression or Necessary Condition Analysis) or estimate the conditional expectation across a broader collection of non-linear models (e.g., nonparametric regression).

In [7]:
#importing the required modules
import seaborn as sns
from IPython.display import display, HTML
import matplotlib.pyplot as plt
import scipy.stats as stats
import pandas as pd
import numpy as np
import seaborn as sns

# ttest and euclidean distance
from scipy.stats import ttest_ind
from scipy.spatial.distance import seuclidean

# linear fit using statsmodels
import statsmodels.api as sm
import statsmodels.formula.api as smf

# good ole sklearn
from sklearn.metrics import euclidean_distances, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, cross_val_predict, KFold
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression

Loading the data

In [8]:


# Small adjustments to default style of plots, making sure it's readable and colorblind-friendly everywhere
plt.style.use('seaborn-colorblind')
plt.rcParams.update({'font.size' : 12.5,
                     'figure.figsize':(10,7)})

#copy the path of the sample quotes: (to big to put in the git)
#ALEX: 'C:/Users/alexb/Documents/Ecole/EPFL/MasterII/ADA/'
#JULES: ...
#MARIN: ...
#NICO: ...


path_2_data = 'C:/Users/alexb/Documents/Ecole/EPFL/MasterII/ADA/'


#import the dataset sample
df = pd.read_json(path_2_data + 'polUS_quotes_speakers_merged.json.bz2',compression="bz2",lines=True)
df2 = pd.read_json(path_2_data + 'Speakers_aggregation.json.bz2',compression="bz2",lines=True)
df3 = pd.read_json(path_2_data + 'df_quotes_pol_all_classified.json.bz2',compression="bz2",lines=True)


In [9]:
df3.columns


Index(['quoteID', 'quotation', 'speaker', 'qid_unique', 'date', 'urls', 'p1',
       'p2', 'delta_p', 'year', 'label', 'aliases', 'date_of_birth',
       'nationality', 'gender', 'lastrevid', 'ethnic_group',
       'US_congress_bio_ID', 'occupation', 'party', 'academic_degree', 'id',
       'candidacy', 'type', 'religion', 'age', 'bi_party',
       'colloquial_NaiveBayes', 'colloquial_contractions'],
      dtype='object')

To insert in our DataStory maybe it could be coooool.

In [10]:
df3

Unnamed: 0,quoteID,quotation,speaker,qid_unique,date,urls,p1,p2,delta_p,year,...,party,academic_degree,id,candidacy,type,religion,age,bi_party,colloquial_NaiveBayes,colloquial_contractions
0,2015-04-16-012993,"Come, Son, Let Me Tell You A Lie,",Robert Johnson,Q16215328,2015-04-16 12:17:09,[http://mysanantonio.com/entertainment/article...,0.6719,0.2895,0.3824,2015,...,[Q29552],,Q16215328,,item,,46.0,Democrat,informal,0
1,2015-05-09-008197,Everybody grieves in their own way. I just wan...,Robert Johnson,Q16215328,2015-05-09 09:25:48,[http://www.foxbaltimore.com/news/features/top...,0.6130,0.2622,0.3508,2015,...,[Q29552],,Q16215328,,item,,46.0,Democrat,informal,0
2,2015-07-31-011595,But a vast minority of sex assault victims are...,Robert Johnson,Q16215328,2015-07-31 15:21:13,[http://pix11.com/2015/07/31/men-convicted-of-...,0.9152,0.0848,0.8304,2015,...,[Q29552],,Q16215328,,item,,46.0,Democrat,informal,0
3,2015-11-05-133940,We are running major law offices with major re...,Robert Johnson,Q16215328,2015-11-05 07:30:00,[http://www.nydailynews.com/new-york/exclusive...,0.7218,0.2288,0.4930,2015,...,[Q29552],,Q16215328,,item,,46.0,Democrat,informal,0
4,2015-12-23-026592,"I did footings, I did walls. Right now, I'm up...",Robert Johnson,Q16215328,2015-12-23 14:38:25,[http://www.mprnews.org/story/2015/10/23/vikes...,0.7711,0.1818,0.5893,2015,...,[Q29552],,Q16215328,,item,,46.0,Democrat,informal,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564653,2020-03-12-051570,People in this area are very in tune with what...,Marge Anderson,Q6760269,2020-03-12 04:29:22,[https://www.recorder.com/b1-Northfield-trash-...,0.8344,0.1656,0.6688,2020,...,,,Q6760269,,item,,89.0,,informal,1
564654,2020-03-01-037715,"to help spread facts, not fear.",Sabrina Cervantes,Q27890015,2020-03-01 01:42:05,[https://mynewsla.com/business/2020/02/29/offi...,0.8328,0.1561,0.6767,2020,...,[Q5020399],,Q27890015,,item,,34.0,,informal,0
564655,2020-03-21-041323,We knew that we needed to do more to protect o...,Luis Alejo,Q6700297,2020-03-21 19:42:48,[https://www.montereyherald.com/2020/03/21/cov...,0.8260,0.1582,0.6678,2020,...,[Q29552],,Q6700297,,item,,47.0,Democrat,informal,0
564656,2020-01-28-083318,the Sheriff has enriched himself,Joel Robideaux,Q6213896,2020-01-28 00:28:26,[https://www.katc.com/news/lafayette-parish/lp...,0.7743,0.2182,0.5561,2020,...,[Q29468],,Q6213896,,item,,59.0,Republican,informal,0


In [11]:
df_pre_reg2 = df3[['bi_party','gender','age','colloquial_NaiveBayes', 'colloquial_contractions']]

In [12]:
df_pre_reg2 = df_pre_reg2.replace(to_replace = 'None', value=np.nan).dropna()
print(len(df_pre_reg2))

512074


In [13]:
#df_pre_reg2 = df3[['quotes_number','gender','fraction']]
#df_pre_reg2 = df_pre_reg2.replace(to_replace='None', value=np.nan).dropna()

df_pre_reg2['gender_all'] = df_pre_reg2['gender'].apply(lambda x: x[0]) 
#index_genre = df_pre_reg2[df_pre_reg2["gender_all"].str.contains("Q1052281|Q1097630||Q12964198|Q15145778|Q15145779|Q18116794|Q2449503|Q27679766|Q48270|Q48279")].index

#"Q1052281|Q1097630||Q12964198|Q15145778|Q15145779|Q18116794|Q2449503|Q27679766|Q48270|Q48279"
#'Q6581072|Q6581097'
index1 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q1052281'))]
index2 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q48279'))]
index3 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q48270'))]
index4 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q27679766'))]
index5 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q18116794'))]
index6 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q15145779'))]
index7 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q15145778'))]
index8 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q12964198'))]
index9 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q1097630'))]
index10 = df_pre_reg2.index[(df_pre_reg2["gender_all"].str.contains('Q2449503'))]


df_pre_reg2 = df_pre_reg2.drop(index1)
df_pre_reg2 = df_pre_reg2.drop(index2)
df_pre_reg2 = df_pre_reg2.drop(index3)
df_pre_reg2 = df_pre_reg2.drop(index4)
df_pre_reg2 = df_pre_reg2.drop(index5)
df_pre_reg2 = df_pre_reg2.drop(index6)
df_pre_reg2 = df_pre_reg2.drop(index7)
df_pre_reg2 = df_pre_reg2.drop(index8)
df_pre_reg2 = df_pre_reg2.drop(index9)
df_pre_reg2 = df_pre_reg2.drop(index10)


To avoid too many features the analysis focuses only on the male and female gender. However, few data have to be discarded by applying this filter.

In [14]:
df_pre_reg2['gender'] = df_pre_reg2['gender'].apply(lambda x: x[0])
df_pre_reg2.drop('gender_all', axis=1, inplace=True)
df_pre_reg2['Index'] = df_pre_reg2.index
df_pre_reg2


Unnamed: 0,bi_party,gender,age,colloquial_NaiveBayes,colloquial_contractions,Index
0,Democrat,Q6581097,46.0,informal,0,0
1,Democrat,Q6581097,46.0,informal,0,1
2,Democrat,Q6581097,46.0,informal,0,2
3,Democrat,Q6581097,46.0,informal,0,3
4,Democrat,Q6581097,46.0,informal,0,4
...,...,...,...,...,...,...
564651,Republican,Q6581097,75.0,informal,0,564651
564652,Republican,Q6581097,64.0,informal,0,564652
564655,Democrat,Q6581097,47.0,informal,0,564655
564656,Republican,Q6581097,59.0,informal,0,564656


In [19]:
df_pre_reg2.colloquial_NaiveBayes[df_pre_reg2.colloquial_NaiveBayes == 'informal'] = 1
df_pre_reg2.colloquial_NaiveBayes[df_pre_reg2.colloquial_NaiveBayes == 'formal'] = 0
df_pre_reg2

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pre_reg2.colloquial_NaiveBayes[df_pre_reg2.colloquial_NaiveBayes == 'informal'] = 1
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_pre_reg2.colloquial_NaiveBayes[df_pre_reg2.colloquial_NaiveBayes == 'formal'] = 0


Unnamed: 0,bi_party,gender,age,colloquial_NaiveBayes,colloquial_contractions,Index
0,Democrat,Q6581097,0.284091,1,0,0
1,Democrat,Q6581097,0.284091,1,0,1
2,Democrat,Q6581097,0.284091,1,0,2
3,Democrat,Q6581097,0.284091,1,0,3
4,Democrat,Q6581097,0.284091,1,0,4
...,...,...,...,...,...,...
564651,Republican,Q6581097,0.613636,1,0,564651
564652,Republican,Q6581097,0.488636,1,0,564652
564655,Democrat,Q6581097,0.295455,1,0,564655
564656,Republican,Q6581097,0.431818,1,0,564656


Normalization of the feature "age"

In [20]:
df_pre_reg2['age'] = (df_pre_reg2['age']-df_pre_reg2['age'].min())/(df_pre_reg2['age'].max()-df_pre_reg2['age'].min())

Defining the features and outcomes

In [21]:
df_feat = df_pre_reg2[['bi_party','gender','age']]
df_score_nb = df_pre_reg2['colloquial_NaiveBayes']
df_score_contr = df_pre_reg2['colloquial_contractions']

Create one-hot encoding for gender and party

In [22]:
#One-Hot Encoding to represent categorical variables as binary vectors
onehot = pd.get_dummies(df_feat[['bi_party','gender']]).add_suffix('_onehot')
df_feat_os=pd.merge(df_feat,
             onehot,
             left_index=True,
             right_index=True)

df_feat_os.drop('bi_party', axis=1, inplace=True)
df_feat_os.drop('gender', axis=1, inplace=True)
#check of the operation
df_feat_os.head()

Unnamed: 0,age,bi_party_Democrat_onehot,bi_party_Republican_onehot,gender_Q6581072_onehot,gender_Q6581097_onehot
0,0.284091,1,0,0,1
1,0.284091,1,0,0,1
2,0.284091,1,0,0,1
3,0.284091,1,0,0,1
4,0.284091,1,0,0,1


In [23]:
#the different columns of the dataframe are converted in numpy arrays:
def numpy_helper(df, cols):
    return df[cols].to_numpy()

In [34]:
#Test the function numpy_helper:
cols = df_feat_os.columns
test_helper = numpy_helper(df_feat_os,cols)
assert('('+str(len(df_pre_reg2))+', '+str(len(cols))+')'==str(test_helper.shape))

#Rapid check of the output dimensions:
print('The dataframe of dimension [{},{}] have been converted into a numpy array of dimensions [{}] '.format(len(df_pre_reg2),len(cols),test_helper.shape))

The dataframe of dimension [512017,5] have been converted into a numpy array of dimensions [(512017, 5)] 


In [35]:
X_reg = numpy_helper(df_feat_os, cols)

#Create Y naive bayse
y_reg_nb = df_score_nb.to_numpy()

#Split the dataset
X_train_nb, X_test_nb, y_train_nb, y_test_nb = train_test_split(X_reg,
                                                                y_reg_nb,
                                                                test_size = 0.3,
                                                                train_size = 0.7,
                                                                random_state = 123)

#Create Y contractions
y_reg_contr = df_score_contr.to_numpy()

#Split the dataset
X_train_contr, X_test_contr, y_train_contr, y_test_contr = train_test_split(X_reg,
                                                                            y_reg_contr,
                                                                            test_size = 0.3,
                                                                            train_size = 0.7,
                                                                            random_state = 123)

Creation and test of the model for naive bayes

In [36]:
#Create of the model:
lin_reg_nb = LinearRegression()

#Train the model 
lin_reg_nb.fit(X_train_nb, y_train_nb)

#Test it
y_predict_nb=lin_reg_nb.predict(X_test_nb)



Contreation and test of the model for contractions

In [37]:
#Create of the model:
lin_reg_contr = LinearRegression()

#Train the model 
lin_reg_contr.fit(X_train_contr, y_train_contr)

#Test it
y_predict_contr=lin_reg_contr.predict(X_test_contr)



In [38]:
#Computation of the R2 score:
print('R2 score 1st linear regression: {:.2}'.format(lin_reg_nb.score(X_test_nb,y_test_nb)))

print('R2 score 1st linear regression: {:.2}'.format(lin_reg_contr.score(X_test_contr,y_test_contr)))

R2 score 1st linear regression: 0.00071
R2 score 1st linear regression: 0.00036


In [30]:

df_coeff['feature'] = cols
df_coeff

df_coeff2['feature'] = cols2



NameError: name 'df_coeff' is not defined