 # Tutorial 2:   Data Exploration in Python


Data Exploration is one of the key step in data analysis. It involves basic task of getting primilarlary understanding on data such as, data structure, statistical distribution of features, interrelationships within features and more. There are following two motivations of using data exploration.

1. It helps in selecting appropriate data-preprocessing technique for the data set
2. Data exploration analysis helps in selecting suitable machine learning technique

Python library Pandas is a powerful way of gaining understanding on data.  Pandas is packed with several smart data manipulation and exploration methods that can be used for basic data analysis. Pandas when integrated with visualization tools can give deep understanding of data. The task of data exploration will use following two main libraries in Python

$\textbf{1. Pandas}$  

$\textbf{2. Matplotlib/Seaborn}$ 

Data exploration is divided into 3 main steps as described following:

$\textbf{1. Basic data details }$ 

$\textbf{2. Summary Statistics}$

$\textbf{3. Data Visualization}$




## 1.1 Basic Data Details

Using pandas, data sets of different formats can be loaded into python enviornment for analysis. Pandas reads the tabular data set as DataFrame object. The three common methods
used to read data set are as follows:

1. read_csv() : it uses commas as seperating delimeter

2. read_table() : it considers '\t' as default delimeter

3. read_excel() : it reads excel file

$\color{red}{Code:}$ The following code uses Pandas to read the CSV file stored on a specific location of the computer and saves it to in a DataFrame object named data. 

In [3]:
import pandas as pd

data = pd.read_csv('Grocery_dataset.csv')

$\color{red}{Code}$: The size of the data set loaded in dataframe of pandas can be printed using following code

In [None]:
print("The number of rows in data set:", len(data))
print("The number of columns in data set:", len(data.columns))


$\color{red}{Code}$: The data types of features can be printed using following code

In [2]:
print(data.dtypes)

Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
dtype: object


$\color{red}{Code}$: The following code helps in changing data type of given feature to some different dtype. For example, converting int data type to float and vice versa, float to object and likewise.


In [None]:
data.Outlet_Establishment_Year = (data.Outlet_Establishment_Year).astype(float) 
data.Item_Visibility = (data.Item_Visibility).astype(int)

# 1.2 Summary Statistics

$\color{red}{Code}$: The following code gets basic statistics of numerical features of the data set such as mean, standard deviation, minimum and maximum value.  For the qualitative attributes in data set, it counts the frequency for each of its distinct values. 


In [None]:
from pandas.api.types import is_numeric_dtype
from pandas.api.types import is_object_dtype
for col in data.columns:
    if is_numeric_dtype(data[col]):
        print('%s:' % (col))
        print('\t Mean = %.2f' % data[col].mean())
        print('\t Standard deviation = %.2f' % data[col].std())
        print('\t Minimum = %.2f' % data[col].min())
        print('\t Maximum = %.2f' % data[col].max())
    if is_object_dtype(data[col]):
        print(data[col].value_counts())
    
    

$\color{red}{Code}$: We can also  display the summary for all the attributes simultaneously in a table using the describe() function. If an attribute is quantitative, it will display its mean, standard deviation and various quantiles (including minimum, median, and maximum) values. If an attribute is qualitative, it will display its number of unique values and the top (most frequent) values.

In [None]:
data.describe(include='all')

$\color{red}{Code}$: The Covariance and Correlation among numerical features can be printed using following code

 

In [None]:
print('Correlation:')
data.corr()

In [None]:
print('Covariance:')
data.cov()

# 1.3 Data Visualization

$\color{red}{Code} :$ The distribution of the feature can be visualized by creating histograms. The histograms are binned for numerical features used hist() function of Pandas data frame whereas, for categorical columns we plot bar plot() using the value_count() and plot.bar(). 


$\color{red}{Code} :$ The following code demonstrate the plotting of histogram of a given feature

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
data['Item_Weight'].hist(bins=8)
plt.title('Distribution of Feature')
plt.xlabel('Range of bins')
plt.ylabel('# of data points falling in bins')

$\color{red}{Code} :$ The following code demonstrate the plotting of bar plot of a given feature

In [None]:
data['Item_Fat_Content'].value_counts().plot.bar(title='Freq dist of Item_Identifier')
plt.title('Distribution of Feature')
plt.xlabel('Range of bins')
plt.ylabel('# of data points falling in bins')

$\color{red}{Code}$:  A boxplot can also be used to show the distribution of values for each attribute. The code below shows the boxplot of two selected features



In [None]:
data.boxplot(column=['Item_Weight','Item_MRP'])

$\color{red}{Code}$: For each pair of attributes, we can use a scatter plot to visualize their joint distribution.

In [None]:
pd.scatter_matrix(data, alpha=0.2, figsize=(10, 10))
plt.show()
    

$\color{red}{Code :}$ The following code plots a correlation matrix plot

In [None]:
plt.matshow(data.corr())
plt.colorbar()
plt.show()

# 2.  Exercise

1. Download the iris data set from https://archive.ics.uci.edu/ml/datasets/Iris . Read the data set in Python. Analyse all fetaures statiscally and interpret correlation results.

2. Download the Housing data set from https://archive.ics.uci.edu/ml/machine-learning-databases/housing/ . Read the data set in Python. Analyse all fetaures statiscally and interpret correlation results.