<a href="https://colab.research.google.com/github/adong-hood/dm-24/blob/main/module_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Module 3: Data Exploration

The following tutorial contains examples of Python code for data exploration. You should refer to the "Data Exploration" chapter of the "Introduction to Data Mining" book (available at https://www-users.cs.umn.edu/~kumar001/dmbook/index.php) to understand some of the concepts introduced in this tutorial notebook. The notebook can be downloaded from http://www.cse.msu.edu/~ptan/dmbook/tutorials/tutorial3/tutorial3.ipynb.

Data exploration refers to the preliminary investigation of data in order
to better understand its specific characteristics. There are two key motivations for data exploration:
1. To help users select the appropriate preprocessing and data analysis technique used.
2. To make use of humans’ abilities to recognize patterns in the data.

Read the step-by-step instructions below carefully. To execute the code, click on the cell and press the SHIFT-ENTER keys simultaneously.

## 3.1. Summary Statistics

Summary statistics are quantities, such as the mean and standard deviation, that capture various characteristics of a potentially large set of values with a single number or a small set of numbers. In this tutorial, we will use the Iris sample data, which contains information on 150 Iris flowers, 50 each from one of three Iris species: Setosa, Versicolour, and Virginica. Each flower is characterized by five attributes:

- sepal length in centimeters

- sepal width in centimeters

- petal length in centimeters

- petal width in centimeters

- class (Setosa, Versicolour, Virginica)

In this tutorial, you will learn how to:

- Load a CSV data file into a Pandas DataFrame object.

- Compute various summary statistics from the DataFrame.

To execute the sample program shown here, make sure you have installed the Pandas library (see Module 2).

**1.** First, you need to download the <a href="http://archive.ics.uci.edu/ml/datasets/Iris">Iris dataset</a> from the UCI machine learning repository.

**<font color='red'>Code:</font>** The following code uses Pandas to read the CSV file and store them in a DataFrame object named data. Next, it will display the first five rows of the data frame.

In [None]:
import pandas as pd

data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None)
data.columns = ['sepal length', 'sepal width', 'petal length', 'petal width', 'class']

data.head()

**2.** For each quantitative attribute, calculate its average, standard deviation, minimum, and maximum values.

**<font color="red">Code:</font>**

In [None]:
from pandas.api.types import is_numeric_dtype

for col in data.columns:
    if is_numeric_dtype(data[col]):
        print('%s:' % (col))
        print('\t Mean = %.2f' % data[col].mean())
        print('\t Standard deviation = %.2f' % data[col].std())
        print('\t Minimum = %.2f' % data[col].min())
        print('\t Maximum = %.2f' % data[col].max())

**3.** For the qualitative attribute (class), count the frequency for each of its distinct values.

**<font color="red">Code:</font>**

In [None]:
data['class'].value_counts()

**4.** It is also possible to display the summary for all the attributes simultaneously in a table using the describe() function. If an attribute is quantitative, it will display its mean, standard deviation and various quantiles (including minimum, median, and maximum) values. If an attribute is qualitative, it will display its number of unique values and the top (most frequent) values.

**<font color="red">Code:</font>**

In [None]:
data.describe(include='all')

Note that count refers to the number of non-missing values for each attribute.

**5.** For multivariate statistics, you can compute the covariance and correlation between pairs of attributes.

**<font color="red">Code:</font>**

In [None]:
print('Covariance:')
numeric_data = data[['sepal length', 'sepal width', 'petal length', 'petal width']]
numeric_data.cov()

In [None]:
print('Correlation:')
numeric_data = data[['sepal length', 'sepal width', 'petal length', 'petal width']]
numeric_data.corr()

## 3.2. Data Visualization

Data visualization is the display of information in a graphic or tabular format. Successful visualization requires that the data (information) be converted into a visual format so that the characteristics of the data and the relationships
among data items or attributes can be analyzed or reported.

In this tutorial, you will learn how to display the Iris data created in Section 3.1. To execute the sample program shown here, make sure you have installed the matplotlib library package (see Module 0 on how to install Python packages).

**1.** First, we will display the histogram for the sepal length attribute by discretizing it into 8 separate bins and counting the frequency for each bin.

**<font color="red">Code:</font>**

In [None]:
%matplotlib inline

data['sepal length'].hist(bins=8)

**2.** A boxplot can also be used to show the distribution of values for each attribute.

**<font color="red">Code:</font>**

In [None]:
data.boxplot()

**3.** For each pair of attributes, we can use a scatter plot to visualize their joint distribution.

**<font color="red">Code:</font>**

In [None]:
import matplotlib.pyplot as plt

fig, axes = plt.subplots(3, 2, figsize=(12,12))
index = 0
for i in range(3):
    for j in range(i+1,4):
        ax1 = int(index/2)
        ax2 = index % 2
        axes[ax1][ax2].scatter(data[data.columns[i]], data[data.columns[j]], color='red')
        axes[ax1][ax2].set_xlabel(data.columns[i])
        axes[ax1][ax2].set_ylabel(data.columns[j])
        index = index + 1

In [None]:
from pandas.plotting import scatter_matrix




# Create pairwise scatterplots
scatter_matrix(data, alpha=0.8, figsize=(12, 12), diagonal='kde', color='red')
plt.show()

**4.** Parallel coordinates can be used to display all the data points simultaneously. Parallel coordinates have one coordinate axis for each attribute, but the different axes are parallel to one other instead of perpendicular, as is traditional. Furthermore, an object is represented as a line instead of as a point. In the example below, the distribution of values for each class can be identified in a separate color.

**<font color="red">Code:</font>**

In [None]:
from pandas.plotting import parallel_coordinates
%matplotlib inline

parallel_coordinates(data, 'class')

## 3.3. Summary

This tutorial presents several examples for data exploration and visualization using the Pandas and matplotlib library packages available in Python.

**<font color='blue'>References:</font>**

1. Documentation on Pandas. https://pandas.pydata.org/
2. Documentation on matplotlib. https://matplotlib.org/
3. Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

## More Examples

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np

In [None]:
happiness_df = pd.read_csv('http://pluto.hood.edu/~dong/datasets/happiness_2017.csv')
happiness_df.head()

In [None]:
print(happiness_df.shape)
happiness_df.columns

In [None]:
life_ladder_df = happiness_df[['Life Ladder','Generosity']]
print(life_ladder_df['Life Ladder'].min())
print(life_ladder_df.shape)
life_ladder_df.head(2)

In [None]:
# selecting multiple columns by names.
df_1 = happiness_df.loc[:, 'Life Ladder':'Generosity']
df_1.head()

In [None]:
# slicing
df_2 = happiness_df.iloc[10:100, 5:10]
df_2.head()


In [None]:
happiness_df['Region'].unique()

In [None]:
western_enrope_df = happiness_df[happiness_df['Region'] == "Western Europe"]
print(western_enrope_df.shape)
western_enrope_df.head(2)

In [None]:
numeric_data_df = happiness_df.select_dtypes(include=['number'])
correlation_matrix = numeric_data_df.corr()
correlation_matrix.style.background_gradient(cmap='coolwarm')

In [None]:
correlation_matrix = happiness_df.select_dtypes(include=['number']).corr()
#print(type(correlation_matrix))
correlation_matrix=correlation_matrix[correlation_matrix < 1].stack()
#print(type(correlation_matrix))
print(correlation_matrix)
correlation_matrix_pos = correlation_matrix.idxmax()
#print(type(correlation_matrix_pos))
print(correlation_matrix_pos)
max_corr_value = correlation_matrix[correlation_matrix_pos]
print(max_corr_value)

## Homework 2 ##

**Please do not manually look for answers even if you can. <font color="red">Your Homework 2 submission should only include content from this point on. pdf, not ipynb.</font>**

### Q-1: Calculating the average, standard deviation, maximum, mininum, median of happiness scores.  
Your solution should only show these statistics for happiness scores.

### Q-2: What is the name and happiness score of the country with the lowest confidence in their national government?

### Q-3 How many countries are in Western Europe?
This will be very easy wiht grouping function, but you can still do it without it

### Q-4: Which two factors have the largest positive correlation and Which two factors have the largest negative correlation?


## Merging data
Let's load the world polulation data.

In [None]:
world_pop_df = pd.read_csv('http://pluto.hood.edu/~dong/datasets//world_countries.csv').dropna(axis=1, how='all')


To extract populations from world_pop_df, we have to merge happiness_df with world_pop_df. Please note some of the country names in <code>world_counties.csv</code> and <code>happiness_2007.csv</code>do not match (See[ countries mismatch file](http://pluto.hood.edu/~dong/datasets/country_mismatch_missing.txt) for your convenience.).

There are 4 kinds of merge: 'inner', 'outer', 'left', and 'right'. We practiced inner merge previously.  

You may find examples from https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html:
Example: US States Data

### Q-5. Which country  has the largest population in Middle East and North Africa.

### Q-6. Find the average population of Latin America and Caribbean.

### Q-7. Problem Statement
You have a dataset containing information about customers and whether they purchased a product or not. The goal is to determine the best attribute to split the data based on the Gini index.

Dataset
<pre>
Customer ID	Age	Income	Purchased
1	22	High	No
2	35	Medium	Yes
3	45	High	Yes
4	25	Low	No
5	30	High	Yes
6	40	Low	No
7	50	Medium	Yes
8	28	Medium	No
</pre>

Calculate the Gini index for each attribute (Age and Income) and determine which attribute should be chosen for the first split in the decision tree.

For age, split the dataset into two groups based on age: younger than 30 and 30 or older. For income, split the dataset into two groups based on income: High and Medium/Low.

show your work for full credits.

In [None]:
#do not include the output from installation.
!apt-get install texlive texlive-xetex texlive-latex-extra pandoc
!pip install pypandoc
!pip install nbconvert

In [None]:
!jupyter nbconvert  '/content/drive/MyDrive/datamining/module-3.ipynb' --to pdf
