## Learning ml data exploration with python

Data scientists explore, analyze, and visualize data using various tools, with **Python** and Jupyter notebooks being among the most popular. Python's flexibility and extensive libraries make it ideal for data science and machine learning.

In [2]:
# Dependencies Install cell
%pip install numpy
%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Exploring data arrays with NumPy

Let's start by looking at some simple data. Suppose a college takes a sample of student grades for a data science class.

In [3]:
import random

# Create 30 students grades randomly and save in list
data = [random.randint(0, 100) for _ in range(30)]

print(data)

[51, 34, 90, 1, 47, 20, 56, 81, 51, 30, 69, 45, 78, 23, 35, 18, 80, 41, 48, 8, 83, 11, 43, 42, 71, 30, 51, 92, 97, 49]


In [4]:
# Create numpy array of data as list is not optimized for numeric analysis
import numpy as np

# convert data list into a numpy array - a numeric data structure optimzed for mathematical operations
grades = np.array(data)

# see the difference between a list and numpy array
print(data)
print(grades)

[51, 34, 90, 1, 47, 20, 56, 81, 51, 30, 69, 45, 78, 23, 35, 18, 80, 41, 48, 8, 83, 11, 43, 42, 71, 30, 51, 92, 97, 49]
[51 34 90  1 47 20 56 81 51 30 69 45 78 23 35 18 80 41 48  8 83 11 43 42
 71 30 51 92 97 49]


In [5]:
# See the difference when we perform multiplication operaton
print(type(data), 'x 2:', data * 2)
print(type(grades), 'x 2:', grades * 2)


<class 'list'> x 2: [51, 34, 90, 1, 47, 20, 56, 81, 51, 30, 69, 45, 78, 23, 35, 18, 80, 41, 48, 8, 83, 11, 43, 42, 71, 30, 51, 92, 97, 49, 51, 34, 90, 1, 47, 20, 56, 81, 51, 30, 69, 45, 78, 23, 35, 18, 80, 41, 48, 8, 83, 11, 43, 42, 71, 30, 51, 92, 97, 49]
<class 'numpy.ndarray'> x 2: [102  68 180   2  94  40 112 162 102  60 138  90 156  46  70  36 160  82
  96  16 166  22  86  84 142  60 102 184 194  98]


Multiplying a list by 2 creates a new list of twice the length with the original sequence of list elements repeated.

Multiplying a NumPy array on the other hand performs an element-wise calculation in which the array behaves like a vector. So we end up with an array of the same size in which each element has been multiplied by 2.

In [6]:
# Shape of numpy array
print(grades.shape)

(30,)


In [7]:
# Add dimension to the grades array

# Define study hours variable
study_hours = [random.randint(1, 24) for _ in range(30)]
print(study_hours)


[16, 12, 12, 2, 8, 1, 5, 8, 6, 19, 9, 9, 8, 9, 3, 19, 24, 3, 5, 18, 15, 24, 23, 20, 12, 19, 3, 11, 1, 11]


In [8]:
# Create a 2D array (an array of arrays)
student_data = np.array([study_hours, grades])
student_data

array([[16, 12, 12,  2,  8,  1,  5,  8,  6, 19,  9,  9,  8,  9,  3, 19,
        24,  3,  5, 18, 15, 24, 23, 20, 12, 19,  3, 11,  1, 11],
       [51, 34, 90,  1, 47, 20, 56, 81, 51, 30, 69, 45, 78, 23, 35, 18,
        80, 41, 48,  8, 83, 11, 43, 42, 71, 30, 51, 92, 97, 49]])

In [9]:
print(student_data.shape)

(2, 30)


In [10]:
# 2d array
print(student_data)
# first array
print(student_data[0])
# second array
print(student_data[1])
# first array - first element
print(student_data[0][0])
# second array - first element
print(student_data[1][0])

[[16 12 12  2  8  1  5  8  6 19  9  9  8  9  3 19 24  3  5 18 15 24 23 20
  12 19  3 11  1 11]
 [51 34 90  1 47 20 56 81 51 30 69 45 78 23 35 18 80 41 48  8 83 11 43 42
  71 30 51 92 97 49]]
[16 12 12  2  8  1  5  8  6 19  9  9  8  9  3 19 24  3  5 18 15 24 23 20
 12 19  3 11  1 11]
[51 34 90  1 47 20 56 81 51 30 69 45 78 23 35 18 80 41 48  8 83 11 43 42
 71 30 51 92 97 49]
16
51


In [11]:
# Apply some operation
# Get the mean value of each sub-array
avg_study = student_data[0].mean()
avg_grade = student_data[1].mean()

print('Average study hours: {:.2f}\nAverage grade: {:.2f}'.format(avg_study, avg_grade))

Average study hours: 11.17
Average grade: 49.17


## Exploring tabular data with Pandas

While NumPy provides a lot of the functionality you need to work with numbers, and specifically arrays of numeric values; when you start to deal with two-dimensional tables of data, the **Pandas** package offers a more convenient structure to work with - the **DataFrame**.

In [15]:
# Create a students dataframe
import pandas as pd

# Define students names
students_names = [ 'Jakeem','Helena','Ismat','Anila','Skye',
                  'Daniel','Aisha', 'Liam', 'Noah', 'Elijah', 
                  'James', 'William', 'Benjamin', 'Lucas', 'Mason', 
                  'Oliver', 'Evelyn', 'Abigail', 'Emily', 'Harper', 
                 'Amelia', 'Ava', 'Sophia', 'Mia', 'Isabella', 
                 'Charlotte', 'Gianna', 'Saif', 'Asif', 'Jawad'
                ]

# print(len(students_names))
# Create a dataframe of students
df_students = pd.DataFrame(
    {
        'Name': students_names,
        'StudyHours': student_data[0],
        'Grade': student_data[1]
    }
)

df_students.head(6)

Unnamed: 0,Name,StudyHours,Grade
0,Jakeem,16,51
1,Helena,12,34
2,Ismat,12,90
3,Anila,2,1
4,Skye,8,47
5,Daniel,1,20


## Finding and filtering data in a DataFrame

In [13]:
# Get the data for index 5
print(df_students.loc[5])

Name          Daniel
StudyHours         1
Grade             20
Name: 5, dtype: object


In [16]:
# Get the rows with index values from 0 to 5
print(df_students.loc[0:5]) 

     Name  StudyHours  Grade
0  Jakeem          16     51
1  Helena          12     34
2   Ismat          12     90
3   Anila           2      1
4    Skye           8     47
5  Daniel           1     20
