The purpose of this lab is to warm up with Python in regards to CSV files (comma separated values), dataframe handling, and numpy analysis.  Please spend time to play around and understand what everything's doing.

# Basics of Numpy and Pandas

In [None]:
import pandas as pd
import numpy as np

# NumPys

NumPys are multi-dimensional arrays (matrices) that allow users to efficiently manipulate data/subarrays, split arrays, reshape arrays, and much more. Think of it as a nested List[List, List, List], but with more functionality

In [None]:
'''We will generate random numbers with Numpys of various sizes''' 
# One-dimensional array with numbers from 0-10
x0 = np.random.randint(10, size=6) 
# One-dimensional array with numbers from 0-10
x1 = np.random.randint(10, size=(1,6)) 
# Two-dimensional array with numbers from 0-10
x2 = np.random.randint(10, size=(3, 4))  
# Three-dimensional array with numbers from 0-10
x3 = np.random.randint(10, size=(3, 4, 5))  


Each Numpy array has the following attributes: 

1. ndim (the number of dimensions)
2. shape (the size of each dimension)
3. size (the total size of the array).






In [None]:
print("x0 is a 1-D 1x6 matrix:\n",x0)
print("x0 ndim: ", x0.ndim)
print("x0 shape:", x0.shape)
print("x0 size: ", x0.size)
print()
print("x1 is a 2-D 1x6 matrix:\n",x1)
print("x1 ndim: ", x1.ndim)
print("x1 shape:", x1.shape)
print("x1 size: ", x1.size)
print()
print("x2 is a 2-D 3x4 matrix:\n",x2)
print("x2 ndim: ", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print()
print("x3 is a 3-D 3x4x5 matrix:\n",x3)
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

x0 is a 1-D 1x6 matrix:
 [5 9 4 8 8 1]
x0 ndim:  1
x0 shape: (6,)
x0 size:  6

x1 is a 2-D 1x6 matrix:
 [[3 9 7 2 5 6]]
x1 ndim:  2
x1 shape: (1, 6)
x1 size:  6

x2 is a 2-D 3x4 matrix:
 [[4 1 3 8]
 [7 5 0 3]
 [4 7 9 0]]
x2 ndim:  2
x2 shape: (3, 4)
x2 size:  12

x3 is a 3-D 3x4x5 matrix:
 [[[5 9 4 8 5]
  [7 9 9 4 8]
  [8 0 2 4 6]
  [3 3 6 4 8]]

 [[4 3 2 7 6]
  [5 8 0 0 3]
  [5 5 1 7 9]
  [9 2 3 8 2]]

 [[8 4 4 5 5]
  [8 3 1 3 2]
  [5 5 5 4 9]
  [6 6 0 1 9]]]
x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


Accessing elements in a NumPy is very similar to accessing elements in a list.  

In [None]:
#getting elements from the x0 (1-D) array
print(x0)
print("4th element: ",x0[3])
print("last element: ",x0[-1])
print("second to last element: ",x0[-2])

[5 9 4 8 8 1]
4th element:  8
last element:  1
second to last element:  8


In [None]:
"""In a multi-dimensional array, items can 
be accessed using a comma-separated tuple 
of indices.
For a 2-D array the first value is the row # (n)
you want the value from, the second value
is the column # (m) you want the value from. 
array[n, m]"""
print(x2)
print("1st row, 1st column: ", x2[0,0])
print("last row, last column: ", x2[-1, -1])
print("2nd row, last column: ",x2[1, -1])
print("2nd row, third column: ",x2[2, 3])


[[4 1 3 8]
 [7 5 0 3]
 [4 7 9 0]]
1st row, 1st column:  4
last row, last column:  0
2nd row, last column:  3
2nd row, third column:  0


Numpy arrays have a fixed number type when they are created, so they can't be changed individual values to a different number type. The type of values in a numpy array can be checked with array.dtype

In [None]:
print("type for x3: ", x3.dtype)
#in this case we cant change individual values in x3 to float64 types,
#unless we change the entire array

type for x3:  int64


Array Slicing: Accessing Subarrays
array[start:stop:step].
If they are undefined it will default to start=0, stop=size of dimension, step=1.


In [None]:
#Example with 1-D arrays
print(x0)
#array[inclusive:exclusive]
#all elements before index 3, exclusive of value at index 3
print(x0[:3]) 
#all elements after index 3, inclusive of value at index 3
print(x0[3:])
#the 2nd element to the 2nd last element
print(x0[1:-1])
#or
print(x0[1:5])
#if exclusive value > len(list) then it its equivalent to array[x:]
print(x0[1:10]) #len of x0 is 6
#accessing every 3rd element in the array
print(x0[::3])
#accessing every other element starting from the 4th element (value at index 3)
print(x0[3::2])
#reversing the elements
print(x0[::-1])
#reversing from the 2nd to last element
print(x0[-2::-1])
#or
print(x0[4::-1])






[5 9 4 8 8 1]
[5 9 4]
[8 8 1]
[9 4 8 8]
[9 4 8 8]
[9 4 8 8 1]
[5 8]
[8 1]
[1 8 8 4 9 5]
[8 8 4 9 5]
[8 8 4 9 5]


In [None]:
#Example with multi-dimensional arrays
print(x2)
#print 1st 2 rows and 1st 3 columns
print(x2[:2, :3])
#print all rows and every other columns
print(x2[:, ::2])
#reversing the array
print(x2[::-1, ::-1])
#getting the second column of x2
print(x2[:, 1])
#getting the last row of x2
print(x2[2, :])
#or
print(x2[2])

[[4 1 3 8]
 [7 5 0 3]
 [4 7 9 0]]
[[4 1 3]
 [7 5 0]]
[[4 3]
 [7 0]
 [4 9]]
[[0 9 7 4]
 [3 0 5 7]
 [8 3 1 4]]
[1 5 7]
[4 7 9 0]
[4 7 9 0]


When creating copies of Sub-Arrays, explicitly state array.copy(), otherwise itll create a slice of that subarray that will reflect any changes done to that slice

In [None]:
#without array.copy()
print(x2)
x2_sub = x2[:2, :2]
print(x2_sub)
x2_sub[0, 0] = 99
print(x2_sub)
print(x2)

[[4 1 3 8]
 [7 5 0 3]
 [4 7 9 0]]
[[4 1]
 [7 5]]
[[99  1]
 [ 7  5]]
[[99  1  3  8]
 [ 7  5  0  3]
 [ 4  7  9  0]]


In [None]:
#with array.copy()
print(x2)
x2_sub = x2[:2, :2].copy()
print(x2_sub)
x2_sub[0, 0] = 0
print(x2_sub)
print(x2)

[[99  1  3  8]
 [ 7  5  0  3]
 [ 4  7  9  0]]
[[99  1]
 [ 7  5]]
[[0 1]
 [7 5]]
[[99  1  3  8]
 [ 7  5  0  3]
 [ 4  7  9  0]]


Reshaping Arrays: array.reshape(n, m), where n and m are the respective dimensions. NOTE: for this to work the initial array size must match the size of the reshaped array. RESHAPING WILL NOT AFFECT THE ORIGINAL ARRAY, IF YOU WANT TO USE THE RESHAPED ARRAY SET IT TO SOMETHING NEW

In [None]:
#making the numbers 1 to 10 in a 3 by 3 matrix
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [None]:
#reshaping a 1D array to a 2D array
x = np.array([1, 2, 3])
print(x)
# row vector via reshape
reshapedx = x.reshape((1, 3))
print(reshapedx)
reX = x.reshape((3, 1))
print(reX)

[1 2 3]
[[1 2 3]]
[[1]
 [2]
 [3]]


Use np.concatenate, np.vstack, and np.hstack. np.concatenate to concatenate NumPy arrays.

In [None]:
#1-D example
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
print(x, y)
comb = np.concatenate([x, y])
print(comb)
#can have more than 2 arrays as well
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

[1 2 3] [3 2 1]
[1 2 3 3 2 1]
[ 1  2  3  3  2  1 99 99 99]


In [None]:
#2-D example
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])
combogrid = np.concatenate([grid, grid])
print(combogrid)

[[1 2 3]
 [4 5 6]
 [1 2 3]
 [4 5 6]]


np.vstack (verical stack) and np.hstack (horizontal stack) offer more control in how the arrays will combined

In [None]:
#Mixed dimensions example 
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])
# vertically stack the arrays
vxgrid = np.vstack([x, grid])
print(vxgrid)
# we cant horizontally combine the arrays because the dimensions dont
# line up, x = (1,3) and grid = (2, 3)

# horizontally stack the y and grid
y = np.array([[99],
              [99]])
#this can be combined horizontally: grid = (2,2) and y = (2,1)
hygrid = np.hstack([grid, y])
print(hygrid)

#np.dstack stacks arrays along the 3rd axis

[[1 2 3]
 [9 8 7]
 [6 5 4]]
[[ 9  8  7 99]
 [ 6  5  4 99]]


add np.dotprod()

# Pandas


`pandas` is a great way to manipulate dataframes that are already existing 

(CSV files). They can also be created from lists and dictionaries.




In [None]:
#example dictionary of names and grades
data = {'Name':['kartik', 'fahed', 'zainub', 'kyle', 'spencer', 'kaleen', 'andrew', 'lise'],
        'Grade': [79, 99, 89, 80, 87, 90, 81, None],
        'Hired': ['y', 'n', 'y', 'y', 'n', 'n', 'y', 'n']}
print(data)
#creating a dataframe
df = pd.DataFrame(data)
df

{'Name': ['kartik', 'fahed', 'zainub', 'kyle', 'spencer', 'kaleen', 'andrew', 'lise'], 'Grade': [79, 99, 89, 80, 87, 90, 81, None], 'Hired': ['y', 'n', 'y', 'y', 'n', 'n', 'y', 'n']}


Unnamed: 0,Name,Grade,Hired
0,kartik,79.0,y
1,fahed,99.0,n
2,zainub,89.0,y
3,kyle,80.0,y
4,spencer,87.0,n
5,kaleen,90.0,n
6,andrew,81.0,y
7,lise,,n


In [None]:
#looking at the first rows of the dataframe (default value 5)
df.head()

Unnamed: 0,Name,Grade,Hired
0,kartik,79.0,y
1,fahed,99.0,n
2,zainub,89.0,y
3,kyle,80.0,y
4,spencer,87.0,n


In [None]:
#looking for missing values
#returns a table with trues and falses for NaNs
df.isnull()

Unnamed: 0,Name,Grade,Hired
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,True,False


In [None]:
#finding the total number of missing values
df.isnull().sum()

Name     0
Grade    1
Hired    0
dtype: int64

In [None]:
#getting the number of elements in a pandas dataframe
print(df.size)
#getting the dimensionality of a pandas dataframe
print(df.shape)

24
(8, 3)


In [None]:
# Selecting columns
df[['Name']]

Unnamed: 0_level_0,Name
Grade,Unnamed: 1_level_1
79.0,kartik
99.0,fahed
89.0,zainub
80.0,kyle
87.0,spencer
90.0,kaleen
81.0,andrew
,lise


In [None]:
#getting column names
df.columns

Index(['Name', 'Grade', 'Hired'], dtype='object')

Selecting rows with .loc
1.   Selecting rows by label/index
2.   Selecting rows with a boolean / conditional lookup


In [None]:
#finding specific values by setting the index of a df to that column
#lets find the score of 99
copyDf = df.copy()
df.set_index("Grade", inplace=True)
df.head()

Unnamed: 0_level_0,Name,Hired
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1
79.0,kartik,y
99.0,fahed,n
89.0,zainub,y
80.0,kyle,y
87.0,spencer,n


In [None]:
#selecting rows with .loc
row = df.loc[87]
row

Name     spencer
Hired          n
Name: 87.0, dtype: object

In [None]:
df.loc[99]
#faheds the only person to score a 99

Name     fahed
Hired        n
Name: 99.0, dtype: object

Changing the index values can be a quick fix but in practice it isn't very practical because everytime we have a new column, we would be changing our index column to that column value we desire. Alternatively we do the following.

In [None]:
#our original df is now copyDf
copyDf.loc[copyDf['Grade'] == 99]

Unnamed: 0,Name,Grade,Hired
1,fahed,99.0,n


In [None]:
#searching for multiple values
copyDf.loc[(copyDf['Grade'] > 85) & (copyDf['Hired'] == 'n')]

Unnamed: 0,Name,Grade,Hired
1,fahed,99.0,n
4,spencer,87.0,n
5,kaleen,90.0,n


In [None]:
#searching for multiple values
copyDf.loc[copyDf['Grade'].isin([80, 81, 82, 84])]

Unnamed: 0,Name,Grade,Hired
3,kyle,80.0,y
6,andrew,81.0,y


Note, most of the values that are being returned are still in a panda format, so they can't be used for arithmetic operations yet. Additionally when we get to making models, they only input numerical values so the categorical data must be converted to numerical representations. There are various ways to do this so refer to this [link](https://https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02) for more insights.

In [None]:
#finding how many people got hired would also be the same on copyDf
hiredStud = df['Hired'].value_counts()[0]
nonhiredStud = df['Hired'].value_counts()[1]
print("hired students: ",hiredStud, "\nnon-hired students: ", nonhiredStud)

hired students:  4 
non-hired students:  4


Selecting row with iloc


1.   similar to numpy slicing



In [None]:
#selecting rows with .iloc
row = df.iloc[-1]
print(row)
#selecting first 5 rows of data frame with all cols
print()
temp = df.iloc[:5, :]
print(temp)


Name     lise
Hired       n
Name: nan, dtype: object

          Name Hired
Grade               
79.0    kartik     y
99.0     fahed     n
89.0    zainub     y
80.0      kyle     y
87.0   spencer     n


Adding/Dropping Columns

In [None]:
#adding a column of equivalent letter grades to copyDf
# A = 90-100, B = 80-89, C = 70-79, NaN = failed
letterGrades = []
for val in copyDf["Grade"]:
  if 90 <= val <=100:
    letterGrades.append("A")
  elif 80 <= val <=89:
    letterGrades.append("B")
  elif 70 <= val <=79:
    letterGrades.append("C")
  elif np.isnan(val):
    letterGrades.append("Fail")
letterGrades
copyDf['LG'] = letterGrades
copyDf




Unnamed: 0,Name,Grade,Hired,LG
0,kartik,79.0,y,C
1,fahed,99.0,n,A
2,zainub,89.0,y,B
3,kyle,80.0,y,B
4,spencer,87.0,n,B
5,kaleen,90.0,n,A
6,andrew,81.0,y,B
7,lise,,n,Fail


In [None]:
#now lets drop the Grade column as the data is repetitive
copyDf.drop(['Grade'], axis = 1, inplace = True)
copyDf

Unnamed: 0,Name,Hired,LG
0,kartik,y,C
1,fahed,n,A
2,zainub,y,B
3,kyle,y,B
4,spencer,n,B
5,kaleen,n,A
6,andrew,y,B
7,lise,n,Fail


# Exercise with real data



* Import [**CAvideos.csv**](https://www.kaggle.com/datasnaek/youtube-new)
  * If you are on Google Colaboratory, import the file using the left sidebar (under the colab logo).  
> Files -> Upload to session storage

  * If you are on another notebook program, just make sure the file is in the same directory as this notebook.












Dataframe handling

First we import Pandas to help use load our CSV file to a DataFrame object.  
For more info, click here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html


In [None]:
# converts the CSV file to a dataframe object
dataframe = pd.read_csv('CAvideos.csv') 

dataframe.head() # shows the first 5 rows of the df

FileNotFoundError: ignored

Here, we do some data cleaning.  We remove the first column "video_id", and store the new dataframe.

In [None]:
trimmed_dataframe = dataframe.drop(columns=['video_id'])
trimmed_dataframe.head()


# Be careful if you use the same object. Example:
# dataframe = dataframe.drop(columns=['id'])
# if you use the above line, then the cell can only be run once (due to itempotency)
# You can do dataframe.drop(columns=['video_id'], inplace=True) to avoid making a
# new copy

The following prints out the column values (some of these are useful features)!  

In [None]:
trimmed_dataframe.columns.values.tolist()

QUESTION:

Check for missing values as they will prevent us from using our values in our model/any other calculations we may perform

1.   Find any missing values if they exist (2.5 points)
2.   Drop the missing values rows if they exist without creating a new dataset (2.5 points)

HINT: https://www.journaldev.com/33492/pandas-dropna-drop-null-na-values-from-dataframe







In [None]:
#1. finding missing values


In [None]:
#2. removing missing values


We can use these to pull out certain columns.  
For example, here are the names of the titles.

In [None]:
trimmed_dataframe["title"].tolist()
# this line does the same exact thing
# list(dataframe['name'])

# **Very Useful Tip**: 
Use `type(obj)` to learn about how the dataframe works to help prevent type mismatches.   

Notice that *dataframe.values* is a numpy array. 




In [None]:
type(trimmed_dataframe)

In [None]:
type(trimmed_dataframe.values)

In [None]:
type(trimmed_dataframe.columns)

In [None]:
type(trimmed_dataframe.columns.values.tolist())

We can extract the numerical and categorical values from a DataFrame into a Numpy array. 


In [None]:
# slices everything into the a numpy7 data array
data_array = trimmed_dataframe.values[:, :]

QUESTION: Show the key three attributes of the numpy array above (3 points)


In [None]:
#Answer:


In [None]:
#Answer: 


In [None]:
#Answer:


QUESTION: Show the first 5 rows and the last column (3 points)

In [None]:
# Answer:


QUESTION: Sort the df by the most disliked videos. (3 points)

Use .head() for your answer



In [None]:
# Answer:


QUESTION: How many videos had > 50,000,000 views (3 points)

In [None]:
# Answer: 


QUESTION: Calculate the average views per video.  

1.   Using np.average (2.5 points)
2.   using df['column'].mean() (2.5 points)







In [None]:
# Answer:


In [None]:
# Answer:


QUESTION: How many videos are by "Ed Sheeran?" (Column name is "channel_title") (3 points)

In [None]:
# Answer:


QUESTION: Find the number of video titles that include Beyonce (3 points)


In [None]:
# Answer:


QUESTION: Find the Number of Videos that disabled comments (3 points)

In [None]:
# Answer:


EXTRA CREDIT: Find the Video with the best Like to Disklike Ratio (5 points)

In [None]:
# Answer:
