## Exploratory Data Analysis with Python (`part -1`)

<br>

Machine Learning (ML) starts with `data` and the first task of ML process is data exploration.

***
Before approaching any ML problem one must spend a significant amount of time to

>- explore
>- analyze
>- visualize

the data.

<div class="alert alert-block alert-info">
    <b>Note:</b><br>
        machine cannot learn without data.<br>
        data explorarion is the most important step before applying any ML algorithm.
</div>

Let us create some data. Run the code in the cell below by clicking the **&#9658; Run** button to see the results.

In [1]:
data = [65,81,97,77,91]
print(data)

[65, 81, 97, 77, 91]


In [2]:
len(data)

5

Consider the data as the grades.

In [3]:
import numpy as np

grades = np.array(data)
print(grades)

[65 81 97 77 91]


Just in case you're wondering about the differences between a list and a NumPy array, let's compare how these data types behave when we use them in an expression that multiplies them by 2.

In [4]:
print (type(data),'x 2:', data * 2)
print('---')
print (type(grades),'x 2:', grades * 2)

<class 'list'> x 2: [65, 81, 97, 77, 91, 65, 81, 97, 77, 91]
---
<class 'numpy.ndarray'> x 2: [130 162 194 154 182]


<div class="alert alert-block alert-info">
    <b>Note:</b><br>
    The key takeaway from this is the NumPy arrays which are specifically designed to support mathematical operations on numeric data and makes them more useful for data analysis than a generic list. 
</div>


The class type for the numpy array above is a numpy.ndarray. The `nd` 
indicates that this is a structure that can consists of multiple dimensions (`n-dimensional array`). Our specific instance has a single dimension of student grades.

In [5]:
grades.shape

(5,)

In [6]:
print(grades)
grades[0]

[65 81 97 77 91]


65

In [7]:
grades.mean()

82.2

Let's add a second set of data for the same students, this time recording the typical `number of hours per week they devoted to studying`.

In [8]:
# Define an array of study hours
study_hours = [6,8,11,6,9]

Now the data consists of a 2-dimensional array - an array of arrays. Let's look at its shape.

In [9]:
# Create a 2D array (an array of arrays)
student_data = np.array([study_hours, grades])

# display the array
student_data

array([[ 6,  8, 11,  6,  9],
       [65, 81, 97, 77, 91]])

In [10]:
# Show shape of 2D array
student_data.shape

(2, 5)

#### `student_data` in tabular form
***
|study_hours|grades|
|-----|------| 
|6|65| 
|8|81|
|11|97| 
|6|77|
|9|91|



The **student_data** array contains two elements, each of which is an array containing 5 elements.

To navigate this structure, you need to specify the position of each element in the hierarchy. So to find the first value in the first array (which contains the study hours data), you can use the following code.

In [11]:
# Show the first element of the first element
student_data[0][0]

6

<div class="alert alert-block alert-warning">
    <b>Task:</b> 
    Print the <b>grades</b> corresponding to the 3rd <b>study_hours<b>.
    <br>
    <b>Output</b> : It should show 97
</div>

In [12]:
# write your code here
student_data[1][2]

97

Now you have a multidimensional array containing both the student's study time and grade information, which you can use to compare data. For example, how does the mean study time compare to the mean grade?

In [13]:
# Get the mean value of each sub-array
avg_study = student_data[0].mean()
avg_grade = student_data[1].mean()

print('Average study hours: {:.2f}\nAverage grade: {:.2f}'.format(avg_study, avg_grade))

Average study hours: 8.00
Average grade: 82.20


## Exploring tabular data with Pandas

While NumPy provides a lot of the functionality you need to work with numbers, and specifically arrays of numeric values; when you start to deal with two-dimensional tables of data, the **Pandas** package offers a more convenient structure to work with - the **DataFrame**.

**DataFrames** are special type of data structure of `pandas` library which holds data in 2D format. `pandas` library provides different tools to work with `DataFrames` so that we can manipulate and visualize the data as per our requirement.   

Run the following cell to import the Pandas library and create a DataFrame with three columns. The first column is a list of `student names`, and the second and third columns are the NumPy arrays containing the `study time` and `grade data`.

In [14]:
import pandas as pd

df_students = pd.DataFrame({'name': ['Dan', 'Anila', 'Pedro', 'Rosie', 'Ethan'],
                            'study_hours':student_data[0],
                            'grade':student_data[1]})

df_students 

Unnamed: 0,name,study_hours,grade
0,Dan,6,65
1,Anila,8,81
2,Pedro,11,97
3,Rosie,6,77
4,Ethan,9,91


<div class="alert alert-block alert-info">
    <b>Note:</b><br>
    In addition to the columns you specified, the DataFrame includes an <b>index</b> to unique identify each row. We could have specified the index explicitly, and assigned any kind of appropriate value (for example, an email address); but because we didn't specify an index, one has been created with a unique integer value for each row.
</div>


### Finding and filtering data in a DataFrame

You can use the DataFrame's **loc** method to retrieve data for a specific index value, like this.

In [15]:
grades[0:2]

array([65, 81])

In [16]:
# Get the data for index value 3
df_students.loc[3]

name           Rosie
study_hours        6
grade             77
Name: 3, dtype: object

In [17]:
# Get the rows with index values from 0 to 5
df_students.loc[0:2]

Unnamed: 0,name,study_hours,grade
0,Dan,6,65
1,Anila,8,81
2,Pedro,11,97


In addition to being able to use the **loc** method to find rows based on the index, you can use the **iloc** method to find rows based on their ordinal position in the DataFrame (regardless of the index):

In [18]:
# Get data in the first five rows
df_students.iloc[0:2]

Unnamed: 0,name,study_hours,grade
0,Dan,6,65
1,Anila,8,81


Look carefully at the `iloc[0:2]` results, and compare them to the `loc[0:2]` results you obtained previously. Can you spot the difference?


The **loc** method returned rows with index *label* in the list of values from *0* to *2* - which includes *0*, *1*, and *2* (`three rows`). However, the **iloc** method returns the rows in the `positions` included in the range 0 to 2, and since integer ranges don't include the upper-bound value, this includes positions *0*, and *1* (`two rows`).

**iloc** identifies data values in a DataFrame by *position*, which extends beyond rows to columns. So for example, you can use it to find the values for the columns in positions 1 and 2 in row 0, like this:

In [19]:
df_students.iloc[0,[1,2]]

study_hours     6
grade          65
Name: 0, dtype: object

<div class="alert alert-block alert-warning">
    <b>Task:</b> 
    Print the <b>name</b> and <b>grade</b> of the student named "Anila".
    <br>
</div>

In [20]:
# write your code here
df_students.iloc[1,[0,1]]

name           Anila
study_hours        8
Name: 1, dtype: object

The **loc** is used to locate data items based on index values rather than positions. In the absence of an explicit index column, the rows in our dataframe are indexed as integer values, but the columns are identified by name:

In [21]:
df_students.loc[0,['name','grade']]

name     Dan
grade     65
Name: 0, dtype: object

#### Different methods to `query`

You can use the **loc** method to find indexed rows based on a filtering expression that references named columns other than the index, like this:

In [22]:
df_students.loc[df_students['name']=='Rosie']

Unnamed: 0,name,study_hours,grade
3,Rosie,6,77


Actually, you don't need to explicitly use the **loc** method to do this - you can simply apply a DataFrame filtering expression, like this:

In [23]:
df_students[df_students['name']=='Rosie']

Unnamed: 0,name,study_hours,grade
3,Rosie,6,77


You can achieve the same results by using the DataFrame's **query** method, like this:

In [24]:
df_students.query('name=="Rosie"')

Unnamed: 0,name,study_hours,grade
3,Rosie,6,77


Another example of this is the way you refer to a DataFrame column name. You can specify the column name as a named index value (as in the `df_students['Name']` examples we've seen so far), or you can use the column as a property of the DataFrame, like this:

In [27]:
df_students[df_students.name == 'Rosie']

Unnamed: 0,name,study_hours,grade
3,Rosie,6,77


<div class="alert alert-block alert-info">
    <b>Note:</b><br>
    <i>axis</i> is 0 means operate along the rows for each column.
    <br>
    <i>axis</i> is 1 means operate along the columns for each row.
</div>

Here is how `axis` operates 

![alt text](images/df_axis.png "Title")

In [28]:
'''
sum along the columns for each row
'''
print(df_students.sum(axis=1))

'''
sum along the rows for each column
'''
print(df_students.sum(axis=0))

0     71
1     89
2    108
3     83
4    100
dtype: int64
name           DanAnilaPedroRosieEthan
study_hours                         40
grade                              411
dtype: object
