## Learning ml data exploration with python

Data scientists explore, analyze, and visualize data using various tools, with **Python** and Jupyter notebooks being among the most popular. Python's flexibility and extensive libraries make it ideal for data science and machine learning.

In [3]:
# Dependencies Install cell
%pip install numpy
%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


## Exploring data arrays with NumPy

Let's start by looking at some simple data. Suppose a college takes a sample of student grades for a data science class.

In [4]:
import random

# Create 30 students grades randomly and save in list
data = [random.randint(0, 100) for _ in range(30)]

print(data)

[99, 68, 9, 19, 31, 10, 100, 60, 62, 51, 91, 24, 37, 86, 52, 65, 74, 10, 21, 65, 17, 51, 74, 77, 2, 50, 26, 34, 95, 82]


In [5]:
# Create numpy array of data as list is not optimized for numeric analysis
import numpy as np

# convert data list into a numpy array - a numeric data structure optimzed for mathematical operations
grades = np.array(data)

# see the difference between a list and numpy array
print(data)
print(grades)

[99, 68, 9, 19, 31, 10, 100, 60, 62, 51, 91, 24, 37, 86, 52, 65, 74, 10, 21, 65, 17, 51, 74, 77, 2, 50, 26, 34, 95, 82]
[ 99  68   9  19  31  10 100  60  62  51  91  24  37  86  52  65  74  10
  21  65  17  51  74  77   2  50  26  34  95  82]


In [6]:
# See the difference when we perform multiplication operaton
print(type(data), 'x 2:', data * 2)
print(type(grades), 'x 2:', grades * 2)


<class 'list'> x 2: [99, 68, 9, 19, 31, 10, 100, 60, 62, 51, 91, 24, 37, 86, 52, 65, 74, 10, 21, 65, 17, 51, 74, 77, 2, 50, 26, 34, 95, 82, 99, 68, 9, 19, 31, 10, 100, 60, 62, 51, 91, 24, 37, 86, 52, 65, 74, 10, 21, 65, 17, 51, 74, 77, 2, 50, 26, 34, 95, 82]
<class 'numpy.ndarray'> x 2: [198 136  18  38  62  20 200 120 124 102 182  48  74 172 104 130 148  20
  42 130  34 102 148 154   4 100  52  68 190 164]


Multiplying a list by 2 creates a new list of twice the length with the original sequence of list elements repeated.

Multiplying a NumPy array on the other hand performs an element-wise calculation in which the array behaves like a vector. So we end up with an array of the same size in which each element has been multiplied by 2.

In [7]:
# Shape of numpy array
print(grades.shape)

(30,)


In [8]:
# Add dimension to the grades array

# Define study hours variable
study_hours = [random.randint(1, 24) for _ in range(30)]
print(study_hours)


[9, 6, 3, 2, 2, 24, 2, 9, 13, 24, 10, 6, 2, 24, 2, 4, 8, 24, 18, 12, 14, 4, 24, 8, 16, 3, 17, 7, 16, 22]


In [9]:
# Create a 2D array (an array of arrays)
student_data = np.array([study_hours, grades])
student_data

array([[  9,   6,   3,   2,   2,  24,   2,   9,  13,  24,  10,   6,   2,
         24,   2,   4,   8,  24,  18,  12,  14,   4,  24,   8,  16,   3,
         17,   7,  16,  22],
       [ 99,  68,   9,  19,  31,  10, 100,  60,  62,  51,  91,  24,  37,
         86,  52,  65,  74,  10,  21,  65,  17,  51,  74,  77,   2,  50,
         26,  34,  95,  82]])

In [10]:
print(student_data.shape)

(2, 30)


In [11]:
# 2d array
print(student_data)
# first array
print(student_data[0])
# second array
print(student_data[1])
# first array - first element
print(student_data[0][0])
# second array - first element
print(student_data[1][0])

[[  9   6   3   2   2  24   2   9  13  24  10   6   2  24   2   4   8  24
   18  12  14   4  24   8  16   3  17   7  16  22]
 [ 99  68   9  19  31  10 100  60  62  51  91  24  37  86  52  65  74  10
   21  65  17  51  74  77   2  50  26  34  95  82]]
[ 9  6  3  2  2 24  2  9 13 24 10  6  2 24  2  4  8 24 18 12 14  4 24  8
 16  3 17  7 16 22]
[ 99  68   9  19  31  10 100  60  62  51  91  24  37  86  52  65  74  10
  21  65  17  51  74  77   2  50  26  34  95  82]
9
99


In [12]:
# Apply some operation
# Get the mean value of each sub-array
avg_study = student_data[0].mean()
avg_grade = student_data[1].mean()

print('Average study hours: {:.2f}\nAverage grade: {:.2f}'.format(avg_study, avg_grade))

Average study hours: 11.17
Average grade: 51.40


## Exploring tabular data with Pandas

While NumPy provides a lot of the functionality you need to work with numbers, and specifically arrays of numeric values; when you start to deal with two-dimensional tables of data, the **Pandas** package offers a more convenient structure to work with - the **DataFrame**.

In [17]:
# Create a students dataframe
import pandas as pd
students_names = [ 'Jakeem','Helena','Ismat','Anila','Skye','Daniel','Aisha', 
                 'Liam', 'Noah', 'Elijah', 'James', 'William', 'Benjamin', 
                 'Lucas', 'Mason', 'Oliver', 'Evelyn', 'Abigail', 'Emily', 'Harper', 
                 'Amelia', 'Ava', 'Sophia', 'Mia', 'Isabella', 'Charlotte', 'Gianna',
                 'Saif', 'Asif', 'Jawad', 'Fahad'
                ]
# df_students = pd.DataFrame(
#     {
#         'Name': ,
#         'StudyHours': student_data[0],
#         'Grade': student_data[1]
#     }
# )

SyntaxError: expression expected after dictionary key and ':' (3044002147.py, line 11)