# Data Management and Visualization

***

## Developing a Research Question and Creating Your Personal Code Book

STEP 1: Choose a data set that you would like to work with.

I am choosing GapMinder dataset.

STEP 2. Identify a specific topic of interest

I am exploring is there a relationship on Polity scores with life expectancy.

STEP 3. Prepare a codebook of your own (i.e., print individual pages or copy screen and paste into a new document) from the larger codebook that includes the questions/items/variables that measure your selected topics.)

In [4]:
df = pd.read_csv("gapminder.csv")
df.columns

Index(['country', 'incomeperperson', 'alcconsumption', 'armedforcesrate', 'breastcancerper100th', 'co2emissions', 'femaleemployrate', 'hivrate', 'internetuserate', 'lifeexpectancy', 'oilperperson', 'polityscore', 'relectricperperson', 'suicideper100th', 'employrate', 'urbanrate'], dtype='object')

In [5]:
df

Unnamed: 0,country,incomeperperson,alcconsumption,armedforcesrate,breastcancerper100th,co2emissions,femaleemployrate,hivrate,internetuserate,lifeexpectancy,oilperperson,polityscore,relectricperperson,suicideper100th,employrate,urbanrate
0,Afghanistan,,.03,.5696534,26.8,75944000,25.6000003814697,,3.65412162280064,48.673,,0,,6.68438529968262,55.7000007629394,24.04
1,Albania,1914.99655094922,7.29,1.0247361,57.4,223747333.333333,42.0999984741211,,44.9899469578783,76.918,,9,636.341383366604,7.69932985305786,51.4000015258789,46.72
2,Algeria,2231.99333515006,.69,2.306817,23.5,2932108666.66667,31.7000007629394,.1,12.5000733055148,73.131,.42009452521537,2,590.509814347428,4.8487696647644,50.5,65.22
3,Andorra,21943.3398976022,10.17,,,,,,81,,,,,5.36217880249023,,88.92
4,Angola,1381.00426770244,5.57,1.4613288,23.1,248358000,69.4000015258789,2,9.99995388324075,51.093,,-2,172.999227388199,14.5546770095825,75.6999969482422,56.7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
208,Vietnam,722.807558834445,3.91,1.0853671,16.2,1425435000,67.5999984741211,.4,27.8518215557703,75.181,,-7,302.725654656034,11.6533222198486,71,27.84
209,West Bank and Gaza,,,5.9360854,,14241333.3333333,11.3000001907349,,36.4227717919075,72.832,,,,,32,71.9
210,"Yemen, Rep.",610.3573673206,.2,2.3162346,35.1,234864666.666667,20.2999992370605,,12.3497504635596,65.493,,-2,130.05783139719,6.26578903198242,39,30.64
211,Zambia,432.226336974583,3.56,.3413352,13,132025666.666667,53.5,13.5,10.124986462443,49.025,,7,168.623030511023,12.0190362930298,61,35.42


## Data Dictionary

| Field          | Description                                                                           |
|----------------|---------------------------------------------------------------------------------------|
|country | Unique Identifier	|
|incomeperperson |	2010 Gross Domestic Product per capita in constant 2000 US$|
|alcconsumption | 2008 alcohol consumption per adult (age 15+), litres	|
|armedforcesrate |	Armed forces personnel (% of total labor force) |
|breastcancerper100th |	2002 breast cancer new cases per 100,000 female|
|co2emissions | 2006 cumulative CO2 emission (metric tons)	|
|femaleemployrate |	2007 female employees age 15+ (% of population)|
|hivrate |	2009 estimated HIV Prevalence % - (Ages 15-49) |
|internetuserate | 2010 Internet users (per 100 people)	|
|lifeexpectancy | 2011 life expectancy at birth (years)	|
|oilperperson | 2010 oil Consumption per capita (tonnes per year and person)	|
|polityscore | 2009 Democracy score (Polity)	|
|relectricperperson | 2008 residential electricity consumption, per person (kWh)	|
|suicideper100th | 2005 Suicide, age adjusted, per 100 000	|
|employrate |2007 total employees age 15+ (% of population)	|
|urbanrate |2008 urban population (% of total)	|

STEP 4. Identify a second topic that you would like to explore in terms of its association with your original topic

The second one is has employment rate influence urban rates.

STEP 5. Add questions/items/variables documenting this second topic to your personal codebook

STEP 6. Perform a literature review to see what research has been previously done on this topic. 

Ref 1: Health advocacy with Gapminder animated statistics

Ref 2: Formalizing students’ informal statistical reasoning on real data: Using Gapminder to follow the cycle of inquiry and visual analyses

Ref 3: USE OF TED.COM and GAPMINDER.ORG IN TEACHING APPLICATIONS OF MATHEMATICS AND STATISTICS

STEP 7. Based on your literature review, develop a hypothesis about what you believe the association might be between these topics. Be sure to integrate the specific variables you selected into the hypothesis. 

Hypothesis suggested: Has suicide rate influenced by HIV rate on victims?

## Import Libraries

In [1]:
import numpy as np
from numpy import count_nonzero
from numpy import median
from numpy import mean
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import random

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.formula.api import ols

import datetime
from datetime import datetime, timedelta

#import os
#import zipfile
import scipy.stats
from collections import Counter

# import pandas_profiling
# from pandas_profiling import ProfileReport

import sklearn
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.linear_model import LinearRegression, LogisticRegression, ElasticNet, Lasso, Ridge
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, auc, classification_report, confusion_matrix, f1_score
from sklearn.metrics import plot_confusion_matrix, plot_roc_curve

%matplotlib inline
#sets the default autosave frequency in seconds
%autosave 60 
sns.set_style('dark')
sns.set(font_scale=1.2)

plt.rc('axes', titlesize=9)
plt.rc('axes', labelsize=14)
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)

import warnings
warnings.filterwarnings('ignore')

#Webscraping
#import requests
#from bs4 import BeautifulSoup

# Use Folium library to plot values on a map.
#import folium

# Use Feature-Engine library
#import feature_engine
#import feature_engine.missing_data_imputers as mdi
#from feature_engine.outlier_removers import Winsorizer
#from feature_engine import categorical_encoders as ce
#from feature_engine.discretisation import EqualWidthDiscretiser, EqualFrequencyDiscretiser, DecisionTreeDiscretiser
#from feature_engine.encoding import OrdinalEncoder

pd.set_option('display.max_columns',None)
#pd.set_option('display.max_rows',None)
pd.set_option('display.width', 1000)
pd.set_option('display.float_format','{:.2f}'.format)

random.seed(0)
np.random.seed(0)
np.set_printoptions(suppress=True)

Autosaving every 60 seconds


## Exploratory Data Analysis

In [None]:
df = pd.read_csv("")

In [None]:
df = pd.read_csv("",parse_dates=['Date'])

In [None]:
df

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.columns

### Groupby Function

In [None]:
df.groupby()

In [None]:
df.groupby()

In [None]:
df.groupby()

### Pandas-Profiling Reports

In [None]:
profile = ProfileReport(df=df, title='Name of Report', minimal=True)

In [None]:
profile.to_notebook_iframe()

In [None]:
profile.to_file("your_report.html")

## Data Visualization

### Univariate Data Exploration

In [None]:
df.hist(bins=50, figsize=(20,10))
plt.suptitle('Histogram Feature Distribution', x=0.5, y=1.02, ha='center', fontsize=20)
plt.tight_layout()
plt.show()

In [None]:
df.boxplot(figsize=(20,10))
plt.suptitle('BoxPlots Feature Distribution', x=0.5, y=1.02, ha='center', fontsize=20)
plt.tight_layout()
plt.show()

In [None]:
# Plot 4 rows and 1 column (can be expanded)

fig, ax = plt.subplots(4,1, sharex=False, figsize=(16,16))
fig.suptitle('Main Title')


sns.barplot(x="", y="", data=df, ax=ax[0])
ax[0].set_title('Title of the first chart')
#ax[0].tick_params('x', labelrotation=45)

sns.barplot(x="", y="", data=df, ax=ax[1])
ax[1].set_title('Title of the second chart')
#ax[1].tick_params('x', labelrotation=45)

sns.barplot(x="", y="", data=df, ax=ax[2])
ax[2].set_title('Title of the third chart')
#ax[2].tick_params('x', labelrotation=45)

sns.barplot(x="", y="", data=df, ax=ax[3])
ax[3].set_title('Title of the fourth chart')
#ax[3].tick_params('x', labelrotation=45)

plt.show()

In [None]:
# Plot 1 rows and 2 columns (can be expanded)

fig, ax = plt.subplots(1,2, sharex=False, figsize=(16,5))
fig.suptitle('Main Title')

sns.countplot(x="", data=df, hue=, ax=ax[0])
ax[0].set_title('Title of the first chart')
ax[0].tick_params('x', labelrotation=45)

sns.countplot(x="", data=df, hue=, ax=ax[1])
ax[1].set_title('Title of the second chart')
ax[1].tick_params('x', labelrotation=45)

plt.show()

In [None]:
#Plot 2 by 2 subplots

fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2,2, sharex=False, figsize=(20,20))
fig.suptitle('Main Title', y=0.5)

sns.countplot(x="", data=df, ax=ax1)
ax1.set_title('Title of the first chart', size=20)
#ax1.tick_params('x', labelrotation=45)


sns.countplot(x="", data=df, ax=ax2)
ax2.set_title('Title of the second chart', size=20)
#ax2.tick_params('x', labelrotation=45)

sns.countplot(x="", data=df, ax=ax3)
ax3.set_title('Title of the third chart', size=20)
#ax3.tick_params('x', labelrotation=45)


sns.countplot(x="", data=df, ax=ax4)
ax4.set_title('Title of the fourth chart', size=20)
#ax4.tick_params('x', labelrotation=45)


plt.show()

In [None]:
fig = plt.figure(figsize=(20,40))

plt.subplot(7,2,1)
plt.title("", size=20)
sns.countplot()

plt.subplot(7,2,2)
plt.title("", size=20)
sns.countplot()

plt.subplot(7,2,3)
plt.title("", size=20)
sns.countplot()

plt.subplot(7,2,4)
plt.title("", size=20)
sns.countplot()

plt.subplot(7,2,5)
plt.title("", size=20)
sns.barplot()

plt.subplot(7,2,6)
plt.title("", size=20)
sns.barplot()

plt.subplot(7,2,7)
plt.title("", size=20)
sns.barplot()

plt.subplot(7,2,8)
plt.title("", size=20)
sns.barplot()

plt.subplot(7,2,9)
plt.title("", size=20)
sns.scatterplot()

plt.subplot(7,2,10)
plt.title("", size=20)
sns.scatterplot()

plt.subplot(7,2,11)
plt.title("", size=20)
sns.scatterplot()

plt.subplot(7,2,12)
plt.title("", size=20)
sns.scatterplot()

plt.subplot(7,2,13)
plt.title("", size=20)
sns.relplot()

plt.subplot(7,2,14)
plt.title("", size=20)
sns.relplot()

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(20,20))


g = sns.catplot(x='', hue = '', row = '',
            kind='count', data=ratings_df,
            height = 3, aspect = 1)

g.set_xlabels("")
g.set_ylabels("")
#g = (g.set_axis_labels("Tip","Total bill(USD)").set(xlim=(0,10),ylim=(0,100)


g.set(xlim=(0,None))
g.set_xticklabels(rotation=90)

plt.suptitle('', x=0.5, y=1.02, ha='center', fontsize=20)

plt.show()

In [None]:
plt.figure(figsize=(20,20))

sns.catplot(x="calories", y="restaurant",

                hue="is_salad", ci=None,

                data=df_calories, color=None, linewidth=3, showfliers = False,

                orient="h", height=20, aspect=1, palette=None,

                kind="box", dodge=True)

plt.xlabel("", size=20)
plt.ylabel("", size=20)
plt.suptitle('', x=0.5, y=1.02, ha='center', fontsize=20)
plt.show()

In [None]:
plt.figure(figsize=(20,20))

sns.relplot(x="age", y="eval", hue="gender",
            row="tenure",
            data=ratings_df, height = 3, aspect = 2)

plt.xlabel("", size=20)
plt.ylabel("", size=20)
plt.suptitle('', x=0.5, y=1.02, ha='center', fontsize=20)
plt.show()

## Time-Series Analysis

In [None]:
timeseries = df[['date','extraction','month', 'day']]

In [None]:
timeseries

In [None]:
timeseries.info()

In [None]:
fig = plt.figure(figsize=(30,10))
sns.lineplot(x=df.date,y=df.amount,data=df, estimator=None)
plt.title("", fontsize=20)
plt.xlabel("", fontsize=20)
plt.ylabel("", fontsize=20)
plt.legend(['',''])
plt.show()

In [None]:
fig = plt.figure(figsize=(30,10))
sns.lineplot(x=df.month,y=df.amount,data=df, estimator=None)
plt.title("", fontsize=20)
plt.xlabel("", fontsize=20)
plt.ylabel("", fontsize=20)
plt.legend(['',''])
plt.show()

In [None]:
fig = plt.figure(figsize=(30,10))
sns.lineplot(x=df.month,y=df.amount,data=df, estimator=None)
plt.title("", fontsize=20)
plt.xlabel("", fontsize=20)
plt.ylabel("", fontsize=20)
plt.legend(['',''])
plt.show()

### Pairplots

In [None]:
plt.figure(figsize=(20,20))
plt.suptitle('Pairplots of features', x=0.5, y=1.02, ha='center', fontsize=20)
sns.pairplot(df.sample(500))
plt.show()

## Bivariate Data Exploration

In [None]:
sns.jointplot(x='', y='',data=df, kind='scatter')

sns.jointplot(x='', y='',data=df, kind='scatter')

sns.jointplot(x='', y='',data=df, kind='scatter')

sns.jointplot(x='', y='',data=df, kind='scatter')

sns.jointplot(x='', y='',data=df, kind='kde')

sns.jointplot(x='', y='',data=df, kind='kde')

sns.jointplot(x='', y='',data=df, kind='hex')

sns.jointplot(x='', y='',data=df, kind='hex')

sns.jointplot(x='', y='',data=df, kind='reg',scatter_kws={'color':'k'},line_kws={'color':'red'})

sns.jointplot(x='', y='',data=df, kind='reg',scatter_kws={'color':'k'},line_kws={'color':'red'})

sns.lmplot(x='num_items', y='total_value', data=df, scatter_kws={'s': 1, 'alpha': 0.1}, height=5, aspect=1,
           line_kws={'lw': 2, 'color': 'red'})

sns.lmplot(x='num_items', y='total_value', data=df, scatter_kws={'s': 1, 'alpha': 0.1}, height=5, aspect=1,
           line_kws={'lw': 2, 'color': 'red'})

plt.tight_layout()
plt.show()

### Regression plot

In [None]:
line_color = {'color': 'red'}
fig , ax = plt.subplots(2,2, figsize=(20,20))

#Feature

ax1 = sns.regplot(x=X_test.bmi, y=lr_pred, line_kws=line_color, ax=ax[0,0])
ax1.set_xlabel("x")
ax1.set_ylabel("y")
ax1.set_title("Plot 1", size=15)

#Feature

ax2 = sns.regplot(x=X_test.s5, y=lr_pred, line_kws=line_color, ax=ax[0,1])
ax2.set_xlabel("x")
ax2.set_ylabel("y")
ax2.set_title("Plot 2", size=15)

#Feature

ax3 = sns.regplot(x=X_test.bp, y=lr_pred, line_kws=line_color, ax=ax[1,0])
ax3.set_xlabel("x")
ax3.set_ylabel("y")
ax3.set_title("Plot 3", size=15)

#Feature

ax4 = sns.regplot(x=X_test.s4, y=lr_pred, line_kws=line_color, ax=ax[1,1])
ax4.set_xlabel("x")
ax4.set_ylabel("y")
ax1.set_title("Plot 4", size=15)

plt.show()

### FacetGrid

In [None]:
g = sns.FacetGrid(data=df, col="column_name", height=3, aspect=1)
g.map(plt.scatter, "numeric", "numeric")
g.add_legend()
plt.show()

## Geospatial Analysis

In [None]:
mapping = usa_stores[['City','Latitude','Longtitude','Sentiment','Revenue']]
mapping

In [None]:
m = folium.Map(location=[37.090240,-95.712891], zoom_start=5)
m

In [None]:
map_df = pd.DataFrame(mapping.groupby(["City","Latitude","Longtitude"]).mean())
map_df

In [None]:
folium.Marker(location=[33.76,-84.42], popup="Atlanta", tooltip="Sentiment=83.69, Revenue=292.57").add_to(m)
folium.Marker(location=[36.23,-115.27], popup="Las Vegas", tooltip="Sentiment=83.72, Revenue=187.40").add_to(m)
folium.Marker(location=[34.11,-118.41], popup="Los Angeles", tooltip="Sentiment=83.75, Revenue=255.95").add_to(m)
folium.Marker(location=[40.69,-73.92], popup="New York", tooltip="Sentiment=83.71, Revenue=328.38").add_to(m)
folium.Marker(location=[32.83,-117.12], popup="San Diego", tooltip="Sentiment=83.70, Revenue=272.93").add_to(m)

m

In [None]:
m.save("filename.html")

In [None]:
state_geo = f"malaysia.geojson"

In [None]:
map2 = folium.Map(location=[4.210484,108.975766], zoom_start=6)

And now to create a `Choropleth` map, we will use the *choropleth* method with the following main parameters:

1.  `geo_data`, which is the GeoJSON file.
2.  `data`, which is the dataframe containing the data.
3.  `columns`, which represents the columns in the dataframe that will be used to create the `Choropleth` map.
4.  `key_on`, which is the key or variable in the GeoJSON file that contains the name of the variable of interest. To determine that, you will need to open the GeoJSON file using any text editor and note the name of the key or variable that contains the name of the countries, since the countries are our variable of interest. In this case, **name** is the key in the GeoJSON file that contains the name of the countries. Note that this key is case_sensitive, so you need to pass exactly as it exists in the GeoJSON file.

In [None]:
folium.Choropleth(geo_data=state_geo, name="choropleth").add_to(map2)

### Correlation

In [None]:
df.corr()

In [None]:
df.corr()["target"].sort_values()

In [None]:
plt.figure(figsize=(16,9))
sns.heatmap(df.corr(),cmap="coolwarm",annot=True,fmt='.2f',linewidths=2)
plt.title("", fontsize=20)
plt.show()

## Hypothesis Testing

The goal of hypothesis testing is to answer the question, “Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” The first step is to quantify the size of the apparent effect by choosing a test statistic (t-test, ANOVA, etc). The next step is to define a null hypothesis, which is a model of the system based on the assumption that the apparent effect is not real. Then compute the p-value, which is the probability of the null hypothesis being true, and finally interpret the result of the p-value, if the value is low, the effect is said to be statistically significant, which means that the null hypothesis may not be accurate.

### T-Test

We will be using the t-test for independent samples. For the independent t-test, the following assumptions must be met.

-   One independent, categorical variable with two levels or group
-   One dependent continuous variable
-   Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group.
-   The dependent variable must follow a normal distribution
-   Assumption of homogeneity of variance


State the hypothesis

-   $H_0: µ\_1 = µ\_2$ ("there is no difference in evaluation scores between male and females")
-   $H_1: µ\_1 ≠ µ\_2$ ("there is a difference in evaluation scores between male and females")


### Levene's Test

In [None]:
scipy.stats.levene(ratings_df[ratings_df['gender'] == 'female']['eval'],
                   ratings_df[ratings_df['gender'] == 'male']['eval'], center='mean')

## T-Test

### One Sample T-Test

In [None]:
t, p = scipy.stats.ttest_1samp(a=df.dose, popmean=1.166667)

In [None]:
print("T-test value is: ", t)
print("p-value value is: ", p)

### Two Samples T-Test

In [None]:
t, p = scipy.stats.ttest_ind(a=df.len,b=df.dose, equal_var = True/False)

In [None]:
print("T-test value is: ",t)
print("p-value value is: ",p)

### ANOVA

First, we group the data into cateries as the one-way ANOVA can't work with continuous variable - using the example from the video, we will create a new column for this newly assigned group our categories will be teachers that are:

-   40 years and younger
-   between 40 and 57 years
-   57 years and older


State the hypothesis

-   $H_0: µ\_1 = µ\_2 = µ\_3$ (the three population means are equal)
-   $H_1:$ At least one of the means differ


### One Way ANOVA

In [None]:
mod = ols('len~supp', data=df).fit()

In [None]:
aov_table = sm.stats.anova_lm(mod,typ=2)

In [None]:
aov_table

In [None]:
f_statistic, p_value = scipy.stats.f_oneway(forty_lower, forty_fiftyseven, fiftyseven_older)
print("F_Statistic: {0}, P-Value: {1}".format(f_statistic,p_value))

### Two-way ANOVA

In [None]:
mod1 = ols('len~supp+dose', data=df).fit()

In [None]:
aov1 = sm.stats.anova_lm(mod1,typ=2)

In [None]:
aov1

### Chi-square

State the hypothesis:

-   $H_0:$ The proportion of teachers who are tenured is independent of gender
-   $H_1:$ The proportion of teachers who are tenured is associated with gender

In [None]:
#Create a Cross-tab table

cont_table  = pd.crosstab(ratings_df['tenure'], ratings_df['gender'])
cont_table

In [None]:
scipy.stats.chi2_contingency(cont_table, correction = True)

In [None]:
chi_square = scipy.stats.chi2_contingency(cont_table, correction = True)

In [None]:
print(f"Chi score is", chi_square[0])

In [None]:
print("P-value is", chi_square[1])

In [None]:
print("Degrees of freedom is", chi_square[2])

### Correlation

State the hypothesis:

-   $H_0:$ Teaching evaluation score is not correlated with beauty score
-   $H_1:$ Teaching evaluation score is correlated with beauty score


In [None]:
pearson_correlation = scipy.stats.pearsonr(ratings_df['beauty'], ratings_df['eval'])

In [None]:
print("Pearson's correlation coefficient is", pearson_correlation[0])

In [None]:
print("P-value is", pearson_correlation[1])

## Data Preprocessing

### Feature Engineering

### Equal Width Discretization

In [None]:
df["demoscorecat"] = df["polityscore"] #Make a copy

In [None]:
disc = EqualWidthDiscretiser(bins=4, variables=['demoscorecat'], return_object=True)

In [None]:
disc

In [None]:
disc.fit(df)

In [None]:
disc.binner_dict_

In [None]:
df2 = disc.fit_transform(df)
df2.head()

In [None]:
df2["demoscorecat"].value_counts().plot.bar()
plt.show()

### Equal Frequency Discretizer

In [None]:
df2["co2cat"] = df2["co2emissions"] #Make a copy

In [None]:
disc = EqualFrequencyDiscretiser(q=5, variables=['co2cat'])

In [None]:
disc.fit(df2)

In [None]:
disc.binner_dict_

In [None]:
df3 = disc.transform(df2)
df3.head()

In [None]:
df3["co2cat"].value_counts().plot.bar()
plt.show()

### Discretisation + OrdinalEncoder

In [None]:
### Choose which columns to be discretized first
df3["incomecat"] = df3["incomeperperson"] #Make a copy
df3["alccat"] = df3["alcconsumption"] #Make a copy

In [None]:
df3.head()

In [None]:
# to encode variables we need them returned as objects for feature-engine
disc = EqualFrequencyDiscretiser(q=5, variables=['incomecat','alccat'], return_object=True)

In [None]:
df4 = disc.fit_transform(df3)
df4.head()

In [None]:
df4["incomecat"].value_counts().plot.bar()
df4["alccat"].value_counts().plot.bar()
plt.show()

In [None]:
# Set y = target variable, and x = independant variables (both must be objects)

In [None]:
df5 = df4[['demoscorecat','incomecat', 'alccat']]
df5.head()

In [None]:
df5.dtypes

In [None]:
df5.groupby('incomecat')['demoscorecat'].mean().plot()
plt.show()

In [None]:
df5.groupby('alccat')['demoscorecat'].mean().plot()
plt.show()

In [None]:
enc = OrdinalEncoder(encoding_method = 'ordered')

In [None]:
X = df5[['incomecat', 'alccat']]

In [None]:
y = df5['demoscorecat']

In [None]:
enc.fit(X, y)

In [None]:
X_transform = enc.transform(X)

In [None]:
enc.encoder_dict_

In [None]:
X_transform  # Transformed for monotonic relationship

In [None]:
pd.concat([X_transform, y], axis=1)

In [None]:
pd.concat([X_transform, y], axis=1).groupby('incomecat')['demoscorecat'].mean().plot()
plt.show()

### Discretisation with Decision Trees

In [None]:
df4['electricat'] = df4['relectricperperson'] #Make a copy

In [None]:
df4.head()

In [None]:
# Let y = demoscorecat, and x = electricat, breastcancerper100th

df6 = df4[['breastcancerper100th','electricat','demoscorecat']]
df6.head()

In [None]:
X = df6[['breastcancerper100th','electricat']]
y = df6['demoscorecat']

In [None]:
# set up the decision tree discretiser indicating:
# cross-validation number (cv)
# how to evaluate model performance (scoring)
# the variables we want to discretise (variables)
# whether it is a target for regression or classification
# and the grid with the parameters we want to test

treeDisc = DecisionTreeDiscretiser(cv=5, scoring='accuracy', variables=['electricat'], regression=False,
                                  param_grid={'max_depth':[1,2,3], 'min_samples_leaf':[2,4,6]})

In [None]:
treeDisc.fit(X,y)

In [None]:
treeDisc.binner_dict_['electricat'].best_params_

In [None]:
treeDisc.scores_dict_['electricat']

In [None]:
X_transform = treeDisc.transform(X) #Only electricat column

In [None]:
X_transform

In [None]:
X_transform.electricat.unique()

In [None]:
# monotonic relationship with target: train set

pd.concat([X_transform, y],axis=1)

### Drop unwanted features

In [None]:
df.columns

In [None]:
df.drop()

### Treat Missing Values

In [None]:
df.isnull().sum()

In [None]:
df[''] = df[''].replace(np.nan,df.mean())

In [None]:
#imputer = mdi.MeanMedianImputer(imputation_method='median',variables=None)

In [None]:
#imputer.fit(df)

In [None]:
#df = imputer.transform(df)

In [None]:
df.isnull().sum()

### Replacing values

In [None]:
df.replace()

### Rounding Values

In [None]:
###pandas.DataFrame.round
df[['internetuserate']] = df[['internetuserate']].round(decimals=0)

### Treat Duplicate Values

In [None]:
df.duplicated(keep='first').sum()

In [None]:
df[df.duplicated(keep=False)] #Check duplicate values

In [None]:
df.drop_duplicates(ignore_index=True, inplace=True)

### Treat Outliers

In [None]:
df.columns

In [None]:
df.describe()

In [None]:
#windsorizer = Winsorizer(distribution='skewed',tail='both',fold=1.5, variables=[])

In [None]:
#windsorizer.fit(df)

In [None]:
#df2 = windsorizer.transform(df)

In [None]:
#df2

In [None]:
#df2.describe()

In [None]:
#windsorizer.left_tail_caps_

In [None]:
#windsorizer.right_tail_caps_

### Type Change

In [None]:
df.info()

In [None]:
df["breastcancerper100th"] = df["breastcancerper100th"].astype('int')

In [None]:
df.info()

### One-hot encoding

In [None]:
df.info()

In [None]:
df["has_gas"] = pd.get_dummies(data=df["has_gas"],drop_first=True)

### Save to CSV

In [None]:
df.to_csv("filename.csv", index=False)

## Regression Analysis

### Linear Regression (StatsModel)

In [None]:
df.columns

In [None]:
y = df['ExpirationMonth']
X = df['NumStores']

In [None]:
X = sm.add_constant(X)

In [None]:
model = sm.OLS(y,X).fit()

In [None]:
model.summary()

In [None]:
prediction = model.predict(x)

In [None]:
linreg = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df).fit()

### Residual Plots

In [None]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_regress_exog(model, 'x_variables', fig=fig)

In [None]:
fig = plt.figure(figsize=(12,8))
fig = sm.graphics.plot_ccpr(prestige_model, "education")
fig.tight_layout(pad=1.0)

### Linear Regression (SKLearn)

## Logistic Regression (StatsModel)

In [None]:
df.columns

In [None]:
y = df['']
X = df['']

In [None]:
X = sm.add_constant(X)

In [None]:
model = sm.Logit(y, X).fit()

In [None]:
model.summary()

In [None]:
logitfit = smf.logit(formula = 'DF ~ Debt_Service_Coverage + cash_security_to_curLiab + TNW', data = hgc).fit()

In [None]:
logitfit = smf.logit(formula = 'DF ~ TNW + C(seg2)', data = hgcdev).fit()

### Logistic Regression (SKLearn)

In [None]:
df.shape

In [None]:
X = df.iloc[:,:4]
y = df.iloc[:,4]

In [None]:
Counter(y)

In [None]:
X.values, y.values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, test_size=0.2, random_state=0, stratify=y)

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
Counter(y_train), Counter(y_test)

In [None]:
lr = LogisticRegression(random_state=0)

In [None]:
lr.fit(X_train,y_train)

In [None]:
lr.coef_

In [None]:
lr.intercept_

In [None]:
y_pred = lr.predict(X_test)

In [None]:
y_pred

In [None]:
print(classification_report(y_test,y_pred))

In [None]:
cm = confusion_matrix(y_test,y_pred)
cm

In [None]:
plot_confusion_matrix(estimator=lr, X=X_test, y_true=y_test, cmap='YlGnBu')
plt.show()

In [None]:
plot_roc_curve(estimator=lr, X=X_test, y=y_test)
plt.show()

#### Python code done by Dennis Lam