# Statistical Analysis
Statistical Analysis is the process of collecting and analyzing data using mathematics to identify patterns and trends. 

It is very important to do statistical analysis on the data we collected before starting impelementing any machine learning model or even come to a simple conclusion from the data. On the base of level of analysis, there are three types of statistical analysis we can conduct. These are: 

- Univariate
- Bivariate
- Multivariate analysis

### Univariate Statistical Analysis
Univariate Analysis is just taking one variable under consideration and describing the summary data contained by that single variable alone. There is no "measuring of effect of one variable on the output". 
Techniques for Univariate Statistical Analysis are:

        - Frequency Distribution Table
        - Histograms
        - Mean, Median and Mode
        - Ranges, Percentiles and Confidence Interval
        - Pie chart, Bar chart and Scatter plot

### Bivariate Statistical Analysis

It compare or contrast the effects of two variables on each other. This type of statistical analysis can be used for measuring the effect of one variable on output (or another variable). Bivariate Statistical techiniques are:

        - Correlation Coefficient
        - Regression Analysis
        - Crosstab
        - T-Test, ANOVA test and Chi-Squared test        

### Multivariate Statistical Analysis

Multivariate Statistical Analysis takes three or more variables together to check for all kind of effects that occur together on output. Multivariate Statistical Analysis techniques are:

        - Cluster Analysis
        - Variance Analysis
        - PCA
        - Redundancy Analysis
        - Binary Logistic Regression

#### Performing some statistical analysis using some specific libraries

In this notebook statistical analysis has been conducted using the "pandas_profiling", "pandasgui" and "sweetviz" libraries. These libraries are very useful to have a quick, summarized and common statistical analysis results. They provide univariant statistical results (mean, meadian, mode, barplot, histogram, range, perenctiles and more) and also Bivariant statistical results (several correlation coefficients, heatmap).

In [4]:
# Let's import the necessary libraries first

import numpy as np
import pandas as pd
import pandasgui
from pandas_profiling import ProfileReport

In [3]:
# First load the dataset named "teacherrating"

df_rating = pd.read_csv("teachingratings.csv")
df_rating.head(3)

Unnamed: 0,minority,age,gender,credits,beauty,eval,division,native,tenure,students,allstudents,prof,PrimaryLast,vismin,female,single_credit,upper_division,English_speaker,tenured_prof
0,yes,36,female,more,0.289916,4.3,upper,yes,yes,24,43,1,0,1,1,0,1,1,1
1,yes,36,female,more,0.289916,3.7,upper,yes,yes,86,125,1,0,1,1,0,1,1,1
2,yes,36,female,more,0.289916,3.6,upper,yes,yes,76,125,1,0,1,1,0,1,1,1


In [18]:
# Now create a profile report using pandas_profiling libarary
# We can see the output both as a html file or as widgets
# First check the html format

profile_rating_html = ProfileReport(df_rating, 
                              title='Profile report of techer rating',
                              explorative = True,
                              html={'style':{'full_width':True}})

In [19]:
profile_rating_html

Summarize dataset:   0%|          | 0/32 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [20]:
# The report also can be saved as html file 
profile_rating_html.to_file('profile_rating.html')

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Here we can see a lot of detials in html format. We can have correlations, feature analysis, missing values, duplicate values and lot of other useful information about the dataset. 

In [17]:
# Now lets try the widget mode

profile_rating_widget = ProfileReport(df_rating, 
                              title='Profile report of techer rating',
                              explorative = True)

profile_rating_widget.to_widgets()

Summarize dataset:   0%|          | 0/32 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

##### An exporatory analysis with padasgui

In [27]:
#First import the library
from pandasgui import show

In [28]:
# The dataset is already loaded in "df_rating" variable. We can apply "show" funtion from pandasgui on it
show(df_rating)

<pandasgui.gui.PandasGui at 0x153911cbf70>

It creates a new window which shows many options to see the insight of the dataset. The new window is as following image:

![Capture.PNG](attachment:Capture.PNG)

- In the DataFrame section, the whole dataset is shown.

- In Filter section, we can filter the dataset using different features and applied conditions.

- In Statistic section, the feature details are shown (missing values, datatype, count, min, max, mean etc)

- In Graphic section, different types of plots are possible

- In Reshape section, different types of change of the dataset is can be done

In [31]:
df_rating.head(5)

Unnamed: 0,minority,age,gender,credits,beauty,eval,division,native,tenure,students,allstudents,prof,PrimaryLast,vismin,female,single_credit,upper_division,English_speaker,tenured_prof
0,yes,36,female,more,0.289916,4.3,upper,yes,yes,24,43,1,0,1,1,0,1,1,1
1,yes,36,female,more,0.289916,3.7,upper,yes,yes,86,125,1,0,1,1,0,1,1,1
2,yes,36,female,more,0.289916,3.6,upper,yes,yes,76,125,1,0,1,1,0,1,1,1
3,yes,36,female,more,0.289916,4.4,upper,yes,yes,77,123,1,1,1,1,0,1,1,1
4,no,59,male,more,-0.737732,4.5,upper,yes,yes,17,20,2,0,0,0,0,1,1,1


###### Using sweetviz library

In [32]:
# Importing the library first
import sweetviz as sv

In [51]:
# Create the analyzed report, save it as html file and show it in the browser
report_sweetviz = sv.analyze(df_rating)
report_sweetviz.show_html("report_sweetviz.html", open_browser = True)

                                             |                                             | [  0%]   00:00 ->…

Report report_sweetviz.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


The statistical details will be more elaborate for the numerical data than the categorical data

In [None]:
# We can also compare two different dataset; for example train and test dataset
# The column "eval" is the target variable

x = df_rating.drop('eval', axis = 1)
y = df_rating['eval']

In [47]:
# Now split the dataset into training and test dataset
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 41)

In [49]:
# Let's save the html file and see the compared result in the browser
report_compared = sv.compare([x_train, 'x_train'], [x_test, 'x_test'])
report_compared.show_html('report_compare.html', open_browser = True)

                                             |                                             | [  0%]   00:00 ->…

Report report_compare.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.
