<a href="https://www.kaggle.com/code/doudouba/elections-predictions-in-czech-republic-v0105?scriptVersionId=94445719" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import datetime
from tqdm.notebook import tqdm
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import roc_auc_score
import pandas as pd
import os
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import GridSearchCV
import matplotlib.pyplot as plt

In [None]:
# set width of Jupyter notebook
from IPython.core.display import HTML
display(HTML("<style>.container { width:70% !important; }</style>"))

# set some visual properties of displaying pandas DataFrame
pd.options.display.max_columns=200
pd.options.display.max_rows=200

In [None]:
#Contents
#1 Introduction 
#1.1 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
#1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
#1.3 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
#2 Business Understanding 
#2.1 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
#2.2 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
#3 Data Understanding 
#3.1 Raw data description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
#4 Data Preparation 
#4.1 Data Preprocessing - cleaning - Transformation . . . . . . . . . . . . . . . . . . . . . 
#4.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
#4.3 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
#5 Modelling 
#6 Evaluation 
#7 Deployment 
#8 Conclusion of the first report 
#Prague, 11-2021 ©

**1. Introduction**

This report presents a Data Science project with Elections dataset andusing The CRoss Industry Structured Process for Data Mining (CRISP-DM) methodology.


We will try to answer following questions: <br>

If there were more people with a college degree in town T, how will it affect the result
for party P?<br>
Will the town S have the poll turnout above the state/region average?<br>
Which parties compete for the same voters?<br>
Which party changes the structure of its electorate the most from 2013 to 2017?<br>

In [None]:
# setup
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.model_selection import cross_val_score
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.set_option("display.precision", 2)
plt.rcParams['figure.figsize'] = [8, 6]

# 2/ Get the data

    2.1/ load the data 
    2.2/ 
    2.3/ Take a Quick Look at the Data Structure!
    2.4/ Create a Test Set!!! (Sampling) 

**2. Data**

The Data Preparation include the preprocessing, the cleaning and transformation of raw data to <br>
clean data that will be used as input data in our model. <br>
The Raw data is loaded in Power BI desktop in UTF8 encoding (to enable the reading of special <br>
characters) and using Power query we perform the data tranformation. <br>
The Raw data (dirty data) is as follows: <br>

• Raw data - 18 214 Ko <br>
• Raw data change type <br>
• Raw data 152 col x over 14764 rows <br>
• Raw data long headers names renamed <br>
• Raw data unnecessary columns removed <br>
• Raw data remove duplicates <br>
• Raw data remove blank rows <br>
• Raw data remove errors <br>

In [None]:
#%pip install openpyxl
#%pip install xlrd 

In [None]:
df1= pd.read_csv('../input/elections/volby.csv',encoding='latin-1')

In [None]:
df1.head()

In [None]:
df1.shape

In [None]:
!pip install openpyxl

In [None]:
data = pd.read_excel('../input/clean-data/clean_data.xlsx')

In [None]:
data.head()

In [None]:
#!pip install google.colab

In [None]:
data=pd.DataFrame(data)

In [None]:
!pip install google.colab

In [None]:

from google.colab import data_table

data_table.enable_dataframe_formatter()

In [None]:
data=pd.DataFrame(data)

In [None]:
data.head()

In [None]:
from google.colab import data_table
data_table.enable_dataframe_formatter()
data[["Region"]]

In [None]:
import ipywidgets as widgets
tab_contents = data
Region = [widgets.Text(description=name) for name in tab_contents]
tab = widgets.Tab()
tab.Region = Region
for ii in range(len(Region)):
    tab.set_title(ii, f"tab_{ii}")
tab

In [None]:
info=data.info
info

In [None]:
data.describe()

In [None]:
def mean_target_encoding(dt, predictor, target, alpha = 0.01):
    total_cnt = len(dt)
    total_dr = np.mean(dt[target])
    dt_grp = dt.groupby(predictor).agg(
        categ_dr = (target, np.mean),
        categ_cnt = (target, len)
    )
    
    dt_grp['categ_freq'] = dt_grp['categ_cnt'] / total_cnt
    dt_grp['categ_encoding'] = (dt_grp['categ_freq'] * dt_grp['categ_dr'] + alpha * total_dr) / (dt_grp['categ_freq'] + alpha)
    
    return dt_grp[['categ_encoding']].to_dict()['categ_encoding']

In [None]:
# Print some numbers about data sample size
print(f'Number of rows:   {data.shape[0]:,}'.replace(',', ' '))
print(f'Number of unique indexes:   {data.index.nunique():,}'.replace(',', ' '))
print(f'Number of columns:   {data.shape[1]:,}'.replace(',', ' '))

In [None]:
# define list of predictors
cols_pred = list(data.columns[1:-4])
# define list of numerical predictors
cols_pred_num = [col for col in cols_pred if data[col].dtype != 'O']
# define list of categorical predictors
cols_pred_cat = [col for col in cols_pred if data[col].dtype == 'O']

print('Numerical predictors:')
print('---------------------')
print(data[cols_pred_num].dtypes)
print()
print('Categorical predictors:')
print('-----------------------')
print(data[cols_pred_cat].dtypes)

In [None]:
# Split the data in train & test
data_train, data_test = train_test_split(data,test_size=0.2,random_state=2)

In [None]:
import pandas as pd
pd.options.mode.chained_assignment = None  # default='warn'

In [None]:
# Drop variables that have more than 70% missing values 
total = data_train.isnull().sum().sort_values(ascending = False)
percent = (data_train.isnull().sum()/data_train.isnull().count()*100).sort_values(ascending = False)
missing_application_train_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
morethan_70perct_missing = missing_application_train_data[missing_application_train_data['Percent'] > 70].index
data_train.drop(columns = morethan_70perct_missing,inplace=True)

In [None]:
# METADEFINITION
#name of the target column
col_target = "Weight"
# define the list of possible predictors 
cols_pred = list(data_train.drop(columns = 'Weight').columns)
# define list of numerical predictors
cols_pred_num = [col for col in cols_pred if data[col].dtype != 'O']
# define list of categorical predictors
cols_pred_cat = [col for col in cols_pred if data[col].dtype == 'O']

In [None]:
# Save the mean encoding parameters for the test set
list_of_mean_encoding = []
for pred in cols_pred_cat:
    new_vals = mean_target_encoding(
        dt=data_train, 
        predictor=pred, 
        target=col_target
    )
    list_of_mean_encoding.append([pred,new_vals])
    
# encode categorical predictors
for pred in cols_pred_cat:
    new_vals = mean_target_encoding(
        dt=data_train, 
        predictor=pred, 
        target=col_target
    )

    additional_values = set(data_train[data_train[pred].notnull()][pred].unique()) - set(new_vals.keys())
    for p in additional_values:
        new_vals[p] = total_dr

    data_train['MTE_' + pred] = data_train[pred].replace(new_vals)

    if 'MTE_' + pred not in cols_pred:
        cols_pred.append('MTE_' + pred)

    if pred in cols_pred:
        cols_pred.remove(pred)
        
# Drop all the old categorical variables        
data_train.drop(columns = cols_pred_cat, inplace = True)

In [None]:
final_predictors = list(data_train.drop(columns = 'Weight').columns)

In [None]:
# find columns with infinity values
cols_with_inf = []
for col in final_predictors:
    if np.any(np.isinf(data_train[col])):
        cols_with_inf.append(col)
        print(f'Column {col} includes infinity values.')
        
# replace infinity values
for col in cols_with_inf:
    data_train[col].replace(np.inf, 9999999, inplace = True)  
    
    
# find columns with NEGATIVE infinity values
cols_with_neginf = []
for col in final_predictors:
    if np.any(np.isneginf(data_train[col])):
        cols_with_neginf.append(col)
        print(f'Column {col} includes negative infinity values.')
        
# replace NEGATIVE infinity values
for col in cols_with_neginf:
    data_train[col].replace(np.inf, 9999999, inplace = True)
    
# find columns with NaN values ALL
cols_with_nan = []
for col in final_predictors:
    if np.any(np.isnan(data_train[col])):
        cols_with_nan.append(col)
        print(f'Column {col} includes NaN values.')
        
        
# replace NaN values
for col in cols_with_nan:
    data_train[col].replace(np.nan, 0, inplace = True)

In [None]:
print(f'{data_train.shape[0]} rows and {data_train.shape[1]} columns')
# Check each column for missing values
print(f'The train set has {data_train.isna().any().sum()} columns with missing values')

In [None]:
y_train = data_train['Weight']
X_train =data_train.drop(columns = 'Weight')

In [None]:
X_train.head()

In [None]:
y_train.head

# Best Features

In [None]:
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import SelectKBest, chi2

In [None]:
# apply SelectKBest class to extract top most features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X_train, y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)

# concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns, dfscores], axis=1)
featureScores.columns = ['Specs', 'Score']
print(featureScores.nlargest(30, 'Score'))

In [None]:
%%time

from sklearn.manifold import TSNE
tsne = TSNE(random_state=17)

X_tsne = tsne.fit_transform(X)

plt.figure(figsize=(12,10))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y, 
            edgecolor='none', alpha=0.7, s=40,
            cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.colorbar()
plt.title('MNIST. t-SNE projection');

In [None]:
# Check the histogram  of both train and test sets


In [None]:
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(max_depth=2, random_state=5, warm_start=True, n_jobs=-1 )
ada = AdaBoostClassifier(base_estimator=clf, n_estimators=700, learning_rate = .1)
ada.fit(X_train,y_train)


How many rows and columns has the dataframe? 
Answer 14764 rows and 88 colums

Hypothesis: Let's explore our dataset and see if variables are of the same scale
Conclusion: Our variables are not of the same scale (mean difference = 37, max diff=602) we decide to use scale reduce method to put them at the same scale.  

In [None]:
data[["P1_17","University"]].describe() 

**3. Methodology**

Importing packages for PCA and Scale Reduce

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale

In [None]:
data.columns

In [None]:
import re 
df=data

We are selecting the dataset to use for PCA 

In [None]:
col_val_num=re.findall('(P[0-9_]+)|(edu_level_[0-9]+)'," ".join(df.columns.tolist()))

#col_val_num=re.findall('P[0-9_]+'," ".join(df.columns.tolist()))
col_val_num=[i[0]+i[1] for i in col_val_num]
col_val_num.extend(["Catholics", "University","Employed","Unemployed"])
col_val_num[:5]

In [None]:
df[["Employed", "Unemployed"]].head()

Prior we want to replace NAN values with the mean of values

In [None]:
X=df[col_val_num]
X=X.fillna(X.mean())

Let's proceed with the scalling

In [None]:
X_cr=scale(X)
X_cr

Here we apply the PCA in 2 axis (2 components)

In [None]:
pca=PCA(n_components=2)
X=pca.fit_transform(X_cr)

In [None]:
print(pca.explained_variance_ratio_)
print(pca.singular_values_)

We notice a PCA of 39 % wich is not good because we were aiming for 80% minimum the maximum and bet value being 100%

In [None]:
#Let's look at the shape of our new data set
X.shape

Let's plot result from PCA to see the variance 

In [None]:
x=[i[0] for i in X]
y=[i[1] for i in X]

In [None]:
plt.plot(X)

In [None]:
plt.plot(x,y, "*") 

Result: We notice a compact group and some noises we suggest to remove and proceed with clustering 

In [None]:
#Isolation forest to remove the noise 
from sklearn.ensemble import IsolationForest
clf = IsolationForest(random_state=0).fit_predict(X_cr)

In [None]:
from IPython.html import widgets
from IPython.display import display

In [None]:
import datetime
import ipywidgets as ipyw

from bokeh.models.widgets.inputs import AutocompleteInput
from IPython.display import display

In [None]:
#index of noises to remove 
clf[clf==-1]
index_to_remove=[i for i in range(len(df)) if clf[i]==-1]

ipyw.Dropdown(options =index_to_remove)

With following function we remove noises from our dataset and repeat the PCA and proceed with clutering

In [None]:
df.drop(index_to_remove, axis=0, inplace=True)

In [None]:
X=df[col_val_num]
X=X.fillna(X.mean())
X_cr=scale(X)
X_cr

In [None]:
pca=PCA(n_components=2)
X=pca.fit_transform(X_cr)
print(pca.explained_variance_ratio_)
print(pca.singular_values_)

In [None]:
x=[i[0] for i in X]
y=[i[1] for i in X]
plt.plot(X)

In [None]:
plt.plot(x,y, "*") 

We now proceed with the CAH clustering with ward method and eucidiand metric. We start wih a treshold of 0 and will decide to increase it according to the result

In [None]:
#librairies for CAH
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
#générer la matrice des liens
Z = linkage(X_cr,method='ward',metric='euclidean')
#affichage du dendrogramme
plt.title("CAH")
dendrogram(Z,labels=df.index,orientation='top',color_threshold=0)
plt.show()

#work on methods and metrics
#change treshlold accordingly 
#with a treshold 0 we notice ... 

We use KMeans to decide on how we should cluster 

In [None]:
#librairies to evaluate the partitions
from sklearn import metrics, cluster
#utilisation de la métrique "silhouette"
#faire varier le nombre de clusters de 2 à 7
res = np.arange(5,dtype="double")
for k in np.arange(5):
    km = cluster.KMeans(n_clusters=k+2)
    km.fit(X_cr)
    res[k] = metrics.silhouette_score(X_cr,km.labels_)
print(res)
#graphique
import matplotlib.pyplot as plt
plt.title("Silhouette")
plt.xlabel("# of clusters")
plt.plot(np.arange(2,7,1),res)
plt.show()

Result show we should use a 2 clusters because it's higher value

In [None]:
kmeans = cluster.KMeans(n_clusters=2)
kmeans.fit(X_cr)
#index triés des groupes
idk = np.argsort(kmeans.labels_)
#affichage des observations et leurs groupes
print(pd.DataFrame(df.index[idk],kmeans.labels_[idk]))
#distances aux centres de classes des observations
print(kmeans.transform(X_cr))
#correspondance avec les groupes de la CAH
#pd.crosstab(groupes_cah,kmeans.labels_)

Then we set the treshold to 250

In [None]:
#librairies for CAH
from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram, linkage
#générer la matrice des liens
Z = linkage(X_cr,method='ward',metric='euclidean')
#affichage du dendrogramme
plt.title("CAH")
dendrogram(Z,labels=df.index,orientation='top',color_threshold=250)
plt.show()

#work on methods and metrics
#change treshlold accordingly 

In [None]:
#we removed 
len(index_to_remove)

Let's look at how much groups we have

In [None]:
from scipy.cluster.hierarchy import fcluster
groupes_cah = fcluster(Z,t=150,criterion='distance')
print(groupes_cah)

Results show we have 7 groups that we will try to represent 

In [None]:
np.unique(groupes_cah)


In [None]:
df.head()

Let's control if there are no duplicated in our ID_N
result: No duplicates found

In [None]:
len(np.unique(df[["ID_N"]]))
len(np.unique(df[["ID_N"]]))/len(df[["ID_N"]]) #no duplicated ID good

Let's recupere our groups now we have 9 groups in lists

In [None]:

import collections as clt 

dict_group=clt.defaultdict(list)

for i in range(len(groupes_cah)):
    g=groupes_cah[i]
    idn=int(df[["ID_N"]].values[i])
    dict_group[g].append(idn)
    


In [None]:
ipyw.Dropdown(options =dict_group[8])

With dict_group[n] we took all the idn and grouped them by cluster. We are now putting them in a list according to the municipality name

In [None]:
dict_villes=clt.defaultdict(list)
for i in range(1,8):
    dict_villes[i]= np.unique(df[df["ID_N"].isin(dict_group[i])]['municipality_name'].values.tolist())
    

Let's find intersections between our clusters. Example between Cluster 1 and Cluster 2 we try to see what municipality they share together

In [None]:
#intersection between town 1 and 2 
intersect=set(dict_villes[1]).intersection(set(dict_villes[2]))
len(intersect)

In [None]:
ipyw.Dropdown(options =intersect)

No Let's try to find what municipality are in cluster 1 but are not in any other cluster 

In [None]:
#what is in group 1 that is not in others
diff=set(dict_villes[1]).difference(set(dict_villes[2,3,4,5,6,7]))
len(diff)


In [None]:
ipyw.Dropdown(options =diff)

Municipalities are shared only by political parties and voters in group 1 meaning they will never be won by political parties from other clusters

**Now Let's try a different approach using Correlation and Linear regression**

Let's split and reduce our dataset.
data reduction: Split to smaller dataset easier to handle
data reduction: split to train and test set

In [None]:
df_region=df.iloc[:,2:5]

df_17 = df.iloc[:, 7:39]

df_13=df.iloc[:,42:68]

df_population=df.iloc[:,69:76]

df_education=df.iloc[:,76:82]

df_religion=df.iloc[:,82:83]

df_employment=df.iloc[:,84:87]

ipyw.Dropdown(options = df_region)

In [None]:
ipyw.Dropdown(options =df_17)

In [None]:
ipyw.Dropdown(options =df_13)

In [None]:
ipyw.Dropdown(options =df_population)

In [None]:
ipyw.Dropdown(options =df_education)

In [None]:
ipyw.Dropdown(options =df_religion)

In [None]:
ipyw.Dropdown(options =df_employment)

In [None]:
ipyw.Dropdown(options =df_religion)

To understand more the data let's find relationship using pearso corelation and plot it using  background diagram

In [None]:
import matplotlib.pyplot as plt

Let's find relationship between education and decision of voting for a political party

In [None]:
df_edu=pd.concat([df_education,df_17], axis=1)

In [None]:
import seaborn as sns
corr_df = df_edu.corr(method='pearson')

In [None]:
corr_df.style.background_gradient(cmap='coolwarm')

Results show that educated people have tendency to vote for following political partys: P20, P15, P12, P9, P1 (red)

Now we explore the Relationship between religion and decision of voting for a political party

In [None]:

df_relg=pd.concat([df_religion,df_17], axis=1)
import seaborn as sns
corr_df = df_relg.corr(method='pearson')
corr_df.style.background_gradient(cmap='coolwarm')

Results show that catholics vote for political party P24, which is a catholic party

*Now we explore relationship between sexe and decision of voting for a particular political party*

In [None]:
df_pop=pd.concat([df_population,df_17], axis=1)
corr_df = df_pop.corr(method='pearson')
corr_df.style.background_gradient(cmap='coolwarm')

Result:
Men as well as Women have tendency to vote for party P29, P21, P15, P12, P8, P4, P1
Divorced People have the same tendency as well as Men and Women
People of age over 65 would vote for P1, P4, P21  while Roma People seem not have impact to the elections issues

*Relationship between employment status and decision of voting for a particular political party*

In [None]:
df_emp=pd.concat([df_employment,df_17], axis=1)
corr_df = df_emp.corr(method='pearson')
corr_df.style.background_gradient(cmap='coolwarm')

Conclusion:
Retired people with part time job vote for P15, P20, P1 and P9 while retired without job would vote 
for P29, P21,P8 and P4
Unemployed people vote for P29, P21, P7 and P4

*Now let's explore the link between political partys*

In [None]:
df_reg=pd.concat([df_region,df_17], axis=1)
corr_df = df_reg.corr(method='pearson')
corr_df.style.background_gradient(cmap='coolwarm')

In [None]:
df_17_13=pd.concat([df_13,df_17], axis=1)
corr_df = df_17_13.corr(method='pearson')
corr_df.style.background_gradient(cmap='coolwarm')

General Conclusion:

Overall correlation we noticed that   P24,P29,P20,P21,P15,P12,P9, P7, P4 and P1 are major political partys 
and have chance to win the elections.
We will therefore focus the rest of the analysis on the mentioned political partys

In [None]:
#Subsetting to train set and storing as csv
df_train=df[['ID_N','Region','P1_13','P4_13','P7_13','P9_13','P12_13','P15_13','P21_13','P24_13', 'University', 'Unemployed', 'Catholics', 'Male','Female']]
df_train.columns=['ID', 'Region', 'P1', 'P4', 'P7', 'P9', 'P12', 'P15', 'P21', 'P24', 'University', 'Unemployed', 'Catholics', 'Male', 'Female']
df_train.to_csv('df_train.csv', index=False)

In [None]:
#test set subset and stroring as csv
df_test=df[['ID_N','Region','P1_17','P4_17','P7_17','P9_17','P12_17','P15_17','P21_17','P24_17','University', 'Unemployed', 'Catholics', 'Male','Female']]
df_test.columns=['ID', 'Region', 'P1', 'P4', 'P7', 'P9', 'P12', 'P15', 'P21', 'P24', 'University', 'Unemployed', 'Catholics', 'Male', 'Female']
df_test.to_csv('df_test.csv', index=False)

Let's check correlation between variables from our new dataset

In [None]:
corr_df = df_train.corr(method='pearson')
corr_df.style.background_gradient(cmap='coolwarm')

Let's go modelling 

In [None]:
#setup

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import statsmodels.api as sm
import statsmodels.formula.api as sms

pd.set_option('display.precision',2)
plt.rcParams['figure.figsize'] = [8, 6]


In [None]:
#training data
df_train

In [None]:
# chart of score2
dummy = plt.hist(df_train.P1)

In [None]:
#Let's go Modelling using LinearRegression Y = aX + b
X=df_train[['P1', 'P4', 'P7', 'P9', 'P12', 'P15', 'P21', 'P24']]
X = np.nan_to_num(X)
XX=np.repeat(1,len(X)).reshape(-1,1)
X=np.concatenate((XX,X),axis = 1)
y=df_train[['Unemployed']]
y = np.nan_to_num(y) 

In [None]:
XX=np.repeat(1,len(X))
XX.shape

In [None]:
#Fit model

modelA=LinearRegression().fit(X,y)

print("Intercept:", modelA.intercept_)
print("coef_B:", modelA.coef_)

### assess model performance
# i. scoring itself directly (not recommended, overrates performance)
print('R2 on itself: ', modelA.score(X, y))
# ii. scoring by a cross-validation
# https://scikit-learn.org/stable/modules/cross_validation.html
scores = cross_val_score(LinearRegression(), X, y, cv=4)
print('R2 by cval: ', scores)

The evaluation of the model over itself using R2 value gave a poor result 0.47 

The R2 vy cval value gave quite good results 

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set(style='white')
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
from sklearn import decomposition
from sklearn import datasets
from mpl_toolkits.mplot3d import Axes3D

# Loading the dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Let's create a beautiful 3d-plot
fig = plt.figure(1, figsize=(6, 5))
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=134)

plt.cla()

for name, label in [('Setosa', 0), ('Versicolour', 1), ('Virginica', 2)]:
    ax.text3D(X[y == label, 0].mean(),
              X[y == label, 1].mean() + 1.5,
              X[y == label, 2].mean(), name,
              horizontalalignment='center',
              bbox=dict(alpha=.5, edgecolor='w', facecolor='w'))
# Change the order of labels, so that they match
y_clr = np.choose(y, [1, 2, 0]).astype(np.float)
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y_clr, 
           cmap=plt.cm.nipy_spectral)

ax.w_xaxis.set_ticklabels([])
ax.w_yaxis.set_ticklabels([])
ax.w_zaxis.set_ticklabels([]);

**4. Summary**

If there were more people with a college degree in town T, how will it affect the result
for party P?
 Will the town S have the poll turnout above the state/region average?
 Which parties compete for the same voters?
 Which party changes the structure of its electorate the most from 2013 to 2017?

Overall correlation we noticed that P24,P29,P20,P21,P15,P12,P9, P7, P4 and P1 are major political partys and have chance to win the elections. We will therefore focus the rest of the analysis on the mentioned political partys.

Results show that catholics vote for political party P24, which is a catholic party

Retired people with part time job vote for P15, P20, P1 and P9 while retired without job would vote for P29, P21,P8 and P4 Unemployed people vote for P29, P21, P7 and P4

There is very low corelation for political partys P8, P16, P17, P18 between 2013 and 2017 meaning 
they probably have changes in their structure between 2013 and 2017

Men as well as Women have tendency to vote for party P29, P21, P15, P12, P8, P4, P1
Divorced People have the same tendency as well as Men and Women
People of age over 65 would vote for P1, P4, P21  while Roma People seem not have impact to the elections issues

Results show that educated people have tendency to vote for following political partys: P20, P15, P12, P9, P1 (red) meaning if there were more educated people political partys listed above would be advantaged.

Results from our clustering show that municipalities are shared only by political parties and voters in group 1 meaning they will never be won by political parties from other clusters



Finally, the evaluation of our LinearRegression model over itself using R2 value gave a poor result 0.47

The R2 vy cval value gave quite good results