## Learning ml data exploration with python

Data scientists explore, analyze, and visualize data using various tools, with **Python** and Jupyter notebooks being among the most popular. Python's flexibility and extensive libraries make it ideal for data science and machine learning.

In [62]:
# Dependencies Install cell
%pip install numpy
%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Exploring data arrays with NumPy

Let's start by looking at some simple data. Suppose a college takes a sample of student grades for a data science class.

In [63]:
import random

# Create 30 students grades randomly and save in list
data = [random.randint(0, 100) for _ in range(30)]

print(data)

[94, 32, 81, 77, 75, 51, 50, 80, 61, 34, 32, 30, 66, 44, 64, 53, 8, 1, 13, 35, 40, 34, 3, 52, 73, 34, 78, 13, 66, 51]


In [64]:
# Create numpy array of data as list is not optimized for numeric analysis
import numpy as np

# convert data list into a numpy array - a numeric data structure optimzed for mathematical operations
grades = np.array(data)

# see the difference between a list and numpy array
print(data)
print(grades)

[94, 32, 81, 77, 75, 51, 50, 80, 61, 34, 32, 30, 66, 44, 64, 53, 8, 1, 13, 35, 40, 34, 3, 52, 73, 34, 78, 13, 66, 51]
[94 32 81 77 75 51 50 80 61 34 32 30 66 44 64 53  8  1 13 35 40 34  3 52
 73 34 78 13 66 51]


In [65]:
# See the difference when we perform multiplication operaton
print(type(data), 'x 2:', data * 2)
print(type(grades), 'x 2:', grades * 2)


<class 'list'> x 2: [94, 32, 81, 77, 75, 51, 50, 80, 61, 34, 32, 30, 66, 44, 64, 53, 8, 1, 13, 35, 40, 34, 3, 52, 73, 34, 78, 13, 66, 51, 94, 32, 81, 77, 75, 51, 50, 80, 61, 34, 32, 30, 66, 44, 64, 53, 8, 1, 13, 35, 40, 34, 3, 52, 73, 34, 78, 13, 66, 51]
<class 'numpy.ndarray'> x 2: [188  64 162 154 150 102 100 160 122  68  64  60 132  88 128 106  16   2
  26  70  80  68   6 104 146  68 156  26 132 102]


Multiplying a list by 2 creates a new list of twice the length with the original sequence of list elements repeated.

Multiplying a NumPy array on the other hand performs an element-wise calculation in which the array behaves like a vector. So we end up with an array of the same size in which each element has been multiplied by 2.

In [66]:
# Shape of numpy array
print(grades.shape)

(30,)


In [67]:
# Add dimension to the grades array

# Define study hours variable
study_hours = [random.randint(1, 24) for _ in range(30)]
print(study_hours)


[18, 4, 21, 8, 5, 1, 23, 14, 21, 6, 1, 10, 16, 23, 23, 10, 8, 24, 8, 4, 1, 7, 22, 22, 12, 14, 15, 17, 22, 1]


In [68]:
# Create a 2D array (an array of arrays)
student_data = np.array([study_hours, grades])
student_data

array([[18,  4, 21,  8,  5,  1, 23, 14, 21,  6,  1, 10, 16, 23, 23, 10,
         8, 24,  8,  4,  1,  7, 22, 22, 12, 14, 15, 17, 22,  1],
       [94, 32, 81, 77, 75, 51, 50, 80, 61, 34, 32, 30, 66, 44, 64, 53,
         8,  1, 13, 35, 40, 34,  3, 52, 73, 34, 78, 13, 66, 51]])

In [69]:
print(student_data.shape)

(2, 30)


In [70]:
# 2d array
print(student_data)
# first array
print(student_data[0])
# second array
print(student_data[1])
# first array - first element
print(student_data[0][0])
# second array - first element
print(student_data[1][0])

[[18  4 21  8  5  1 23 14 21  6  1 10 16 23 23 10  8 24  8  4  1  7 22 22
  12 14 15 17 22  1]
 [94 32 81 77 75 51 50 80 61 34 32 30 66 44 64 53  8  1 13 35 40 34  3 52
  73 34 78 13 66 51]]
[18  4 21  8  5  1 23 14 21  6  1 10 16 23 23 10  8 24  8  4  1  7 22 22
 12 14 15 17 22  1]
[94 32 81 77 75 51 50 80 61 34 32 30 66 44 64 53  8  1 13 35 40 34  3 52
 73 34 78 13 66 51]
18
94


In [71]:
# Apply some operation
# Get the mean value of each sub-array
avg_study = student_data[0].mean()
avg_grade = student_data[1].mean()

print('Average study hours: {:.2f}\nAverage grade: {:.2f}'.format(avg_study, avg_grade))

Average study hours: 12.70
Average grade: 47.50


## Exploring tabular data with Pandas

While NumPy provides a lot of the functionality you need to work with numbers, and specifically arrays of numeric values; when you start to deal with two-dimensional tables of data, the **Pandas** package offers a more convenient structure to work with - the **DataFrame**.

In [72]:
# Create a students dataframe
import pandas as pd

# Define students names
students_names = [ 'Jakeem','Helena','Ismat','Anila','Skye',
                  'Daniel','Aisha', 'Liam', 'Noah', 'Elijah', 
                  'James', 'William', 'Benjamin', 'Lucas', 'Mason', 
                  'Oliver', 'Evelyn', 'Abigail', 'Emily', 'Harper', 
                 'Amelia', 'Ava', 'Sophia', 'Mia', 'Isabella', 
                 'Charlotte', 'Gianna', 'Saif', 'Asif', 'Jawad'
                ]

# print(len(students_names))
# Create a dataframe of students
df_students = pd.DataFrame(
    {
        'Name': students_names,
        'StudyHours': student_data[0],
        'Grade': student_data[1]
    }
)

df_students.head(6)

Unnamed: 0,Name,StudyHours,Grade
0,Jakeem,18,94
1,Helena,4,32
2,Ismat,21,81
3,Anila,8,77
4,Skye,5,75
5,Daniel,1,51


## Finding and filtering data in a DataFrame

In [73]:
# Get the data for index 5
print(df_students.loc[5])

Name          Daniel
StudyHours         1
Grade             51
Name: 5, dtype: object


In [74]:
# Get the rows with index values from 0 to 5
print(df_students.loc[0:5]) 

     Name  StudyHours  Grade
0  Jakeem          18     94
1  Helena           4     32
2   Ismat          21     81
3   Anila           8     77
4    Skye           5     75
5  Daniel           1     51


In [75]:
# Get the data in the first five rows (index 0 to 4)
print(df_students.iloc[0:5])

     Name  StudyHours  Grade
0  Jakeem          18     94
1  Helena           4     32
2   Ismat          21     81
3   Anila           8     77
4    Skye           5     75


In [76]:
# Fet data values from columns in position 1 and 2 in row 0
print(df_students.iloc[0, [1, 2]])

StudyHours    18
Grade         94
Name: 0, dtype: object


In [77]:
# Grade values of 0'th column
print(df_students.loc[0, "StudyHours"])
print(df_students.loc[0,'Grade'])

18
94


In [78]:
# Find by name
df_students.loc[df_students['Name']=='Aisha']

Unnamed: 0,Name,StudyHours,Grade
6,Aisha,23,50


In [79]:
df_students[df_students['Name']=='Aisha']

Unnamed: 0,Name,StudyHours,Grade
6,Aisha,23,50


In [80]:
# Use query to filder
print(df_students.query('Name=="Aisha"'))

    Name  StudyHours  Grade
6  Aisha          23     50


In [81]:
print(df_students[df_students.Name == 'Aisha'])

    Name  StudyHours  Grade
6  Aisha          23     50


### Loading a DataFrame from a file

In [82]:
df_students = pd.read_csv('./../../data/grades.csv', delimiter=',', header=0)
df_students.head()

Unnamed: 0,Name,StudyHours,Grade
0,Dan,10.0,50.0
1,Joann,11.5,50.0
2,Pedro,9.0,47.0
3,Rosie,16.0,97.0
4,Ethan,9.25,49.0


### Handling missing values

In [83]:
## Calculate sum of missing values
df_students.isnull().sum()

Name          0
StudyHours    1
Grade         2
dtype: int64

In [84]:
### Filter the dataframe to include only rows where any of the columns is null
df_students[df_students.isnull().any(axis=1)]

Unnamed: 0,Name,StudyHours,Grade
22,Bill,8.0,
23,Ted,,


### Imputing missing values

In [85]:
# Fill the missing study hours of ted with an average study hr
df_students.StudyHours = df_students.StudyHours.fillna(df_students.StudyHours.mean())
df_students[df_students.Name=='Ted']

Unnamed: 0,Name,StudyHours,Grade
23,Ted,10.413043,


In [86]:
# Dropping rows with missing values
df_students = df_students.dropna(axis=0, how='any')
# See any rows with missing values
df_students[df_students.isnull().any(axis=1)]

Unnamed: 0,Name,StudyHours,Grade


## Explore data in a DataFrame

In [87]:
# Get the mean study hours using column as an index
mean_study = df_students['StudyHours'].mean()
mean_study

np.float64(10.522727272727273)

In [88]:
# Get the mean grade using column as property
mean_grade = df_students.Grade.mean()
mean_grade

np.float64(49.18181818181818)

In [89]:
print('Average weekly study hours: {:.2f}'.format(mean_study))
print('Average grade: {:.2f}'.format(mean_grade))

Average weekly study hours: 10.52
Average grade: 49.18


In [90]:
# Filter students that studied more than average
df_students[df_students.StudyHours > mean_study]

Unnamed: 0,Name,StudyHours,Grade
1,Joann,11.5,50.0
3,Rosie,16.0,97.0
6,Frederic,11.5,53.0
9,Giovanni,14.5,74.0
10,Francesca,15.5,82.0
11,Rajab,13.75,62.0
14,Jenny,15.5,70.0
19,Skye,12.0,52.0
20,Daniel,12.5,63.0
21,Aisha,12.0,64.0


In [91]:
# These students mean grade
df_students[df_students.StudyHours > mean_study].Grade.mean()

np.float64(66.7)

### Feature Engineering

In [92]:
# Add a pass column for students who passed on score 60
## Create a panda series for column
passes = pd.Series(df_students['Grade'] >= 60)
# Add the column to data
df_students = pd.concat([df_students, passes.rename('Pass')], axis=1)
df_students

Unnamed: 0,Name,StudyHours,Grade,Pass
0,Dan,10.0,50.0,False
1,Joann,11.5,50.0,False
2,Pedro,9.0,47.0,False
3,Rosie,16.0,97.0,True
4,Ethan,9.25,49.0,False
5,Vicky,1.0,3.0,False
6,Frederic,11.5,53.0,False
7,Jimmie,9.0,42.0,False
8,Rhonda,8.5,26.0,False
9,Giovanni,14.5,74.0,True


#### Data Analytics

In [93]:
# Count the number of students who passed
number_passed = df_students.groupby(df_students.Pass).Name.count()
number_passed

Pass
False    15
True      7
Name: Name, dtype: int64

In [95]:
# Mean study time and grade based on passes
print(df_students.groupby(df_students.Pass)[['StudyHours', 'Grade']].mean())

       StudyHours      Grade
Pass                        
False    8.783333  38.000000
True    14.250000  73.142857
