The purpose of this lab is to warm up with Python in regards to CSV files (comma separated values), dataframe handling, and numpy analysis.  Please spend time to play around and understand what everything's doing.

# Basics of Numpy and Pandas

In [20]:
import pandas as pd
import numpy as np

# NumPys

NumPys are multi-dimensional arrays (matrices) that allow users to efficiently manipulate data/subarrays, split arrays, reshape arrays, and much more. Think of it as a nested List[List, List, List], but with more functionality

In [21]:
'''We will generate random numbers with Numpys of various sizes''' 
# One-dimensional array with numbers from 0-10
x0 = np.random.randint(10, size=6) 
# One-dimensional array with numbers from 0-10
x1 = np.random.randint(10, size=(1,6)) 
# Two-dimensional array with numbers from 0-10
x2 = np.random.randint(10, size=(3, 4))  
# Three-dimensional array with numbers from 0-10
x3 = np.random.randint(10, size=(3, 4, 5))  


Each Numpy array has the following attributes: 

1. ndim (the number of dimensions)
2. shape (the size of each dimension)
3. size (the total size of the array).






In [123]:
print("x0 is a 1-D 1x6 matrix:\n",x0)
print("x0 ndim: ", x0.ndim)
print("x0 shape:", x0.shape)
print("x0 size: ", x0.size)
print()
print("x1 is a 2-D 1x6 matrix:\n",x1)
print("x1 ndim: ", x1.ndim)
print("x1 shape:", x1.shape)
print("x1 size: ", x1.size)
print()
print("x2 is a 2-D 3x4 matrix:\n",x2)
print("x2 ndim: ", x2.ndim)
print("x2 shape:", x2.shape)
print("x2 size: ", x2.size)
print()
print("x3 is a 3-D 3x4x5 matrix:\n",x3)
print("x3 ndim: ", x3.ndim)
print("x3 shape:", x3.shape)
print("x3 size: ", x3.size)

x0 is a 1-D 1x6 matrix:
 [3 0 2 0 8 9]
x0 ndim:  1
x0 shape: (6,)
x0 size:  6

x1 is a 2-D 1x6 matrix:
 [[5 2 1 3 9 2]]
x1 ndim:  2
x1 shape: (1, 6)
x1 size:  6

x2 is a 2-D 3x4 matrix:
 [[99  1  0  5]
 [ 9  1  7  6]
 [ 0  9  1  9]]
x2 ndim:  2
x2 shape: (3, 4)
x2 size:  12

x3 is a 3-D 3x4x5 matrix:
 [[[1 4 6 7 3]
  [0 5 3 3 2]
  [3 6 2 7 2]
  [9 4 7 9 1]]

 [[2 5 3 6 1]
  [1 8 0 6 5]
  [9 4 0 6 3]
  [2 6 4 6 5]]

 [[4 1 6 5 6]
  [8 0 1 7 9]
  [1 1 2 0 5]
  [2 5 2 9 8]]]
x3 ndim:  3
x3 shape: (3, 4, 5)
x3 size:  60


Accessing elements in a NumPy is very similar to accessing elements in a list.  

In [23]:
#getting elements from the x0 (1-D) array
print(x0)
print("4th element: ",x0[3])
print("last element: ",x0[-1])
print("second to last element: ",x0[-2])

[3 0 2 0 8 9]
4th element:  0
last element:  9
second to last element:  8


In [24]:
"""In a multi-dimensional array, items can 
be accessed using a comma-separated tuple 
of indices.
For a 2-D array the first value is the row # (n)
you want the value from, the second value
is the column # (m) you want the value from. 
array[n, m]"""
print(x2)
print("1st row, 1st column: ", x2[0,0])
print("last row, last column: ", x2[-1, -1])
print("2nd row, last column: ",x2[1, -1])
print("2nd row, third column: ",x2[2, 3])


[[2 1 0 5]
 [9 1 7 6]
 [0 9 1 9]]
1st row, 1st column:  2
last row, last column:  9
2nd row, last column:  6
2nd row, third column:  9


Numpy arrays have a fixed number type when they are created, so they can't be changed individual values to a different number type. The type of values in a numpy array can be checked with array.dtype

In [25]:
print("type for x3: ", x3.dtype)
#in this case we cant change individual values in x3 to float64 types,
#unless we change the entire array

type for x3:  int32


Array Slicing: Accessing Subarrays
array[start:stop:step].
If they are undefined it will default to start=0, stop=size of dimension, step=1.


In [26]:
#Example with 1-D arrays
print(x0)
#array[inclusive:exclusive]
#all elements before index 3, exclusive of value at index 3
print(x0[:3]) 
#all elements after index 3, inclusive of value at index 3
print(x0[3:])
#the 2nd element to the 2nd last element
print(x0[1:-1])
#or
print(x0[1:5])
#if exclusive value > len(list) then it its equivalent to array[x:]
print(x0[1:10]) #len of x0 is 6
#accessing every 3rd element in the array
print(x0[::3])
#accessing every other element starting from the 4th element (value at index 3)
print(x0[3::2])
#reversing the elements
print(x0[::-1])
#reversing from the 2nd to last element
print(x0[-2::-1])
#or
print(x0[4::-1])






[3 0 2 0 8 9]
[3 0 2]
[0 8 9]
[0 2 0 8]
[0 2 0 8]
[0 2 0 8 9]
[3 0]
[0 9]
[9 8 0 2 0 3]
[8 0 2 0 3]
[8 0 2 0 3]


In [27]:
#Example with multi-dimensional arrays
print(x2)
#print 1st 2 rows and 1st 3 columns
print(x2[:2, :3])
#print all rows and every other columns
print(x2[:, ::2])
#reversing the array
print(x2[::-1, ::-1])
#getting the second column of x2
print(x2[:, 1])
#getting the last row of x2
print(x2[2, :])
#or
print(x2[2])

[[2 1 0 5]
 [9 1 7 6]
 [0 9 1 9]]
[[2 1 0]
 [9 1 7]]
[[2 0]
 [9 7]
 [0 1]]
[[9 1 9 0]
 [6 7 1 9]
 [5 0 1 2]]
[1 1 9]
[0 9 1 9]
[0 9 1 9]


When creating copies of Sub-Arrays, explicitly state array.copy(), otherwise itll create a slice of that subarray that will reflect any changes done to that slice

In [28]:
#without array.copy()
print(x2)
x2_sub = x2[:2, :2]
print(x2_sub)
x2_sub[0, 0] = 99
print(x2_sub)
print(x2)

[[2 1 0 5]
 [9 1 7 6]
 [0 9 1 9]]
[[2 1]
 [9 1]]
[[99  1]
 [ 9  1]]
[[99  1  0  5]
 [ 9  1  7  6]
 [ 0  9  1  9]]


In [29]:
#with array.copy()
print(x2)
x2_sub = x2[:2, :2].copy()
print(x2_sub)
x2_sub[0, 0] = 0
print(x2_sub)
print(x2)

[[99  1  0  5]
 [ 9  1  7  6]
 [ 0  9  1  9]]
[[99  1]
 [ 9  1]]
[[0 1]
 [9 1]]
[[99  1  0  5]
 [ 9  1  7  6]
 [ 0  9  1  9]]


Reshaping Arrays: array.reshape(n, m), where n and m are the respective dimensions. NOTE: for this to work the initial array size must match the size of the reshaped array. RESHAPING WILL NOT AFFECT THE ORIGINAL ARRAY, IF YOU WANT TO USE THE RESHAPED ARRAY SET IT TO SOMETHING NEW

In [30]:
#making the numbers 1 to 10 in a 3 by 3 matrix
grid = np.arange(1, 10).reshape((3, 3))
print(grid)

[[1 2 3]
 [4 5 6]
 [7 8 9]]


In [31]:
#reshaping a 1D array to a 2D array
x = np.array([1, 2, 3])
print(x)
# row vector via reshape
reshapedx = x.reshape((1, 3))
print(reshapedx)
reX = x.reshape((3, 1))
print(reX)

[1 2 3]
[[1 2 3]]
[[1]
 [2]
 [3]]


Use np.concatenate, np.vstack, and np.hstack. np.concatenate to concatenate NumPy arrays.

In [32]:
#1-D example
x = np.array([1, 2, 3])
y = np.array([3, 2, 1])
print(x, y)
comb = np.concatenate([x, y])
print(comb)
#can have more than 2 arrays as well
z = [99, 99, 99]
print(np.concatenate([x, y, z]))

[1 2 3] [3 2 1]
[1 2 3 3 2 1]
[ 1  2  3  3  2  1 99 99 99]


In [33]:
#2-D example
grid = np.array([[1, 2, 3],
                 [4, 5, 6]])
combogrid = np.concatenate([grid, grid])
print(combogrid)

[[1 2 3]
 [4 5 6]
 [1 2 3]
 [4 5 6]]


np.vstack (verical stack) and np.hstack (horizontal stack) offer more control in how the arrays will combined

In [34]:
#Mixed dimensions example 
x = np.array([1, 2, 3])
grid = np.array([[9, 8, 7],
                 [6, 5, 4]])
# vertically stack the arrays
vxgrid = np.vstack([x, grid])
print(vxgrid)
# we cant horizontally combine the arrays because the dimensions dont
# line up, x = (1,3) and grid = (2, 3)

# horizontally stack the y and grid
y = np.array([[99],
              [99]])
#this can be combined horizontally: grid = (2,2) and y = (2,1)
hygrid = np.hstack([grid, y])
print(hygrid)

#np.dstack stacks arrays along the 3rd axis

[[1 2 3]
 [9 8 7]
 [6 5 4]]
[[ 9  8  7 99]
 [ 6  5  4 99]]


add np.dotprod()

# Pandas


`pandas` is a great way to manipulate dataframes that are already existing 

(CSV files). They can also be created from lists and dictionaries.




In [35]:
#example dictionary of names and grades
data = {'Name':['kartik', 'fahed', 'zainub', 'kyle', 'spencer', 'kaleen', 'andrew', 'lise'],
        'Grade': [79, 99, 89, 80, 87, 90, 81, None],
        'Hired': ['y', 'n', 'y', 'y', 'n', 'n', 'y', 'n']}
print(data)
#creating a dataframe
df = pd.DataFrame(data)
df

{'Name': ['kartik', 'fahed', 'zainub', 'kyle', 'spencer', 'kaleen', 'andrew', 'lise'], 'Grade': [79, 99, 89, 80, 87, 90, 81, None], 'Hired': ['y', 'n', 'y', 'y', 'n', 'n', 'y', 'n']}


Unnamed: 0,Name,Grade,Hired
0,kartik,79.0,y
1,fahed,99.0,n
2,zainub,89.0,y
3,kyle,80.0,y
4,spencer,87.0,n
5,kaleen,90.0,n
6,andrew,81.0,y
7,lise,,n


In [36]:
#looking at the first rows of the dataframe (default value 5)
df.head()

Unnamed: 0,Name,Grade,Hired
0,kartik,79.0,y
1,fahed,99.0,n
2,zainub,89.0,y
3,kyle,80.0,y
4,spencer,87.0,n


In [37]:
#looking for missing values
#returns a table with trues and falses for NaNs
df.isnull()

Unnamed: 0,Name,Grade,Hired
0,False,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,True,False


In [38]:
#finding the total number of missing values
df.isnull().sum()

Name     0
Grade    1
Hired    0
dtype: int64

In [39]:
#getting the number of elements in a pandas dataframe
print(df.size)
#getting the dimensionality of a pandas dataframe
print(df.shape)

24
(8, 3)


In [40]:
# Selecting columns
df[['Name']]

Unnamed: 0,Name
0,kartik
1,fahed
2,zainub
3,kyle
4,spencer
5,kaleen
6,andrew
7,lise


In [41]:
#getting column names
df.columns

Index(['Name', 'Grade', 'Hired'], dtype='object')

Selecting rows with .loc
1.   Selecting rows by label/index
2.   Selecting rows with a boolean / conditional lookup


In [42]:
#finding specific values by setting the index of a df to that column
#lets find the score of 99
copyDf = df.copy()
df.set_index("Grade", inplace=True)
df.head()

Unnamed: 0_level_0,Name,Hired
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1
79.0,kartik,y
99.0,fahed,n
89.0,zainub,y
80.0,kyle,y
87.0,spencer,n


In [43]:
#selecting rows with .loc
row = df.loc[87]
row

Name     spencer
Hired          n
Name: 87.0, dtype: object

In [44]:
df.loc[99]
#faheds the only person to score a 99

Name     fahed
Hired        n
Name: 99.0, dtype: object

Changing the index values can be a quick fix but in practice it isn't very practical because everytime we have a new column, we would be changing our index column to that column value we desire. Alternatively we do the following.

In [45]:
#our original df is now copyDf
copyDf.loc[copyDf['Grade'] == 99]

Unnamed: 0,Name,Grade,Hired
1,fahed,99.0,n


In [46]:
#searching for multiple values
copyDf.loc[(copyDf['Grade'] > 85) & (copyDf['Hired'] == 'n')]

Unnamed: 0,Name,Grade,Hired
1,fahed,99.0,n
4,spencer,87.0,n
5,kaleen,90.0,n


In [47]:
#searching for multiple values
copyDf.loc[copyDf['Grade'].isin([80, 81, 82, 84])]

Unnamed: 0,Name,Grade,Hired
3,kyle,80.0,y
6,andrew,81.0,y


Note, most of the values that are being returned are still in a panda format, so they can't be used for arithmetic operations yet. Additionally when we get to making models, they only input numerical values so the categorical data must be converted to numerical representations. There are various ways to do this so refer to this [link](https://https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02) for more insights.

In [48]:
#finding how many people got hired would also be the same on copyDf
hiredStud = df['Hired'].value_counts()[0]
nonhiredStud = df['Hired'].value_counts()[1]
print("hired students: ",hiredStud, "\nnon-hired students: ", nonhiredStud)

hired students:  4 
non-hired students:  4


Selecting row with iloc


1.   similar to numpy slicing



In [49]:
#selecting rows with .iloc
row = df.iloc[-1]
print(row)
#selecting first 5 rows of data frame with all cols
print()
temp = df.iloc[:5, :]
print(temp)


Name     lise
Hired       n
Name: nan, dtype: object

          Name Hired
Grade               
79.0    kartik     y
99.0     fahed     n
89.0    zainub     y
80.0      kyle     y
87.0   spencer     n


Adding/Dropping Columns

In [50]:
#adding a column of equivalent letter grades to copyDf
# A = 90-100, B = 80-89, C = 70-79, NaN = failed
letterGrades = []
for val in copyDf["Grade"]:
  if 90 <= val <=100:
    letterGrades.append("A")
  elif 80 <= val <=89:
    letterGrades.append("B")
  elif 70 <= val <=79:
    letterGrades.append("C")
  elif np.isnan(val):
    letterGrades.append("Fail")
letterGrades
copyDf['LG'] = letterGrades
copyDf




Unnamed: 0,Name,Grade,Hired,LG
0,kartik,79.0,y,C
1,fahed,99.0,n,A
2,zainub,89.0,y,B
3,kyle,80.0,y,B
4,spencer,87.0,n,B
5,kaleen,90.0,n,A
6,andrew,81.0,y,B
7,lise,,n,Fail


In [51]:
#now lets drop the Grade column as the data is repetitive
copyDf.drop(['Grade'], axis = 1, inplace = True)
copyDf

Unnamed: 0,Name,Hired,LG
0,kartik,y,C
1,fahed,n,A
2,zainub,y,B
3,kyle,y,B
4,spencer,n,B
5,kaleen,n,A
6,andrew,y,B
7,lise,n,Fail


# Exercise with real data



* Import [**CAvideos.csv**](https://www.kaggle.com/datasnaek/youtube-new)
  * If you are on Google Colaboratory, import the file using the left sidebar (under the colab logo).  
> Files -> Upload to session storage

  * If you are on another notebook program, just make sure the file is in the same directory as this notebook.












Dataframe handling

First we import Pandas to help use load our CSV file to a DataFrame object.  
For more info, click here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html


In [52]:
# converts the CSV file to a dataframe object
dataframe = pd.read_csv('CAvideos.csv') 

dataframe.head() # shows the first 5 rows of the df

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,n1WpP7iowLc,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
1,0dBIkQ4Mz1M,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095828,132239,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...
4,2Vv-BfVoq4g,17.14.11,Ed Sheeran - Perfect (Official Music Video),Ed Sheeran,10,2017-11-09T11:04:14.000Z,"edsheeran|""ed sheeran""|""acoustic""|""live""|""cove...",33523622,1634130,21082,85067,https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg,False,False,False,🎧: https://ad.gt/yt-perfect\n💰: https://atlant...


Here, we do some data cleaning.  We remove the first column "video_id", and store the new dataframe.

In [53]:
trimmed_dataframe = dataframe.drop(columns=['video_id'])
trimmed_dataframe.head()


# Be careful if you use the same object. Example:
# dataframe = dataframe.drop(columns=['id'])
# if you use the above line, then the cell can only be run once (due to itempotency)
# You can do dataframe.drop(columns=['video_id'], inplace=True) to avoid making a
# new copy

Unnamed: 0,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False,Eminem's new track Walk on Water ft. Beyoncé i...
1,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False,STill got a lot of packages. Probably will las...
2,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095828,132239,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...
4,17.14.11,Ed Sheeran - Perfect (Official Music Video),Ed Sheeran,10,2017-11-09T11:04:14.000Z,"edsheeran|""ed sheeran""|""acoustic""|""live""|""cove...",33523622,1634130,21082,85067,https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg,False,False,False,🎧: https://ad.gt/yt-perfect\n💰: https://atlant...


The following prints out the column values (some of these are useful features)!  

In [54]:
trimmed_dataframe.columns.values.tolist()

['trending_date',
 'title',
 'channel_title',
 'category_id',
 'publish_time',
 'tags',
 'views',
 'likes',
 'dislikes',
 'comment_count',
 'thumbnail_link',
 'comments_disabled',
 'ratings_disabled',
 'video_error_or_removed',
 'description']

QUESTION:

Check for missing values as they will prevent us from using our values in our model/any other calculations we may perform

1.   Find any missing values if they exist (2.5 points)
2.   Drop the missing values rows if they exist without creating a new dataset (2.5 points)

HINT: https://www.journaldev.com/33492/pandas-dropna-drop-null-na-values-from-dataframe







In [55]:
trimmed_dataframe.isnull().sum()


trending_date                0
title                        0
channel_title                0
category_id                  0
publish_time                 0
tags                         0
views                        0
likes                        0
dislikes                     0
comment_count                0
thumbnail_link               0
comments_disabled            0
ratings_disabled             0
video_error_or_removed       0
description               1296
dtype: int64

In [56]:
trimmed_dataframe.drop(['description'], axis = 1, inplace = True)
trimmed_dataframe

Unnamed: 0,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed
0,17.14.11,Eminem - Walk On Water (Audio) ft. Beyoncé,EminemVEVO,10,2017-11-10T17:00:03.000Z,"Eminem|""Walk""|""On""|""Water""|""Aftermath/Shady/In...",17158579,787425,43420,125882,https://i.ytimg.com/vi/n1WpP7iowLc/default.jpg,False,False,False
1,17.14.11,PLUSH - Bad Unboxing Fan Mail,iDubbbzTV,23,2017-11-13T17:00:00.000Z,"plush|""bad unboxing""|""unboxing""|""fan mail""|""id...",1014651,127794,1688,13030,https://i.ytimg.com/vi/0dBIkQ4Mz1M/default.jpg,False,False,False
2,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146035,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False
3,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095828,132239,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False
4,17.14.11,Ed Sheeran - Perfect (Official Music Video),Ed Sheeran,10,2017-11-09T11:04:14.000Z,"edsheeran|""ed sheeran""|""acoustic""|""live""|""cove...",33523622,1634130,21082,85067,https://i.ytimg.com/vi/2Vv-BfVoq4g/default.jpg,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
40876,18.14.06,HOW2: How to Solve a Mystery,Annoying Orange,24,2018-06-13T18:00:07.000Z,"annoying orange|""funny""|""fruit""|""talking""|""ani...",80685,1701,99,1312,https://i.ytimg.com/vi/sGolxsMSGfQ/default.jpg,False,False,False
40877,18.14.06,Eli Lik Lik Episode 13 Partie 01,Elhiwar Ettounsi,24,2018-06-13T19:01:18.000Z,"hkayet tounsia|""elhiwar ettounsi""|""denya okhra...",103339,460,66,51,https://i.ytimg.com/vi/8HNuRNi8t70/default.jpg,False,False,False
40878,18.14.06,KINGDOM HEARTS III – SQUARE ENIX E3 SHOWCASE 2...,Kingdom Hearts,20,2018-06-11T17:30:53.000Z,"Kingdom Hearts|""KH3""|""Kingdom Hearts 3""|""Froze...",773347,25900,224,3881,https://i.ytimg.com/vi/GWlKEM3m2EE/default.jpg,False,False,False
40879,18.14.06,Trump Advisor Grovels To Trudeau,The Young Turks,25,2018-06-13T04:00:05.000Z,"180612__TB02SorryExcuse|""News""|""Politics""|""The...",115225,2115,182,1672,https://i.ytimg.com/vi/lbMKLzQ4cNQ/default.jpg,False,False,False


We can use these to pull out certain columns.  
For example, here are the names of the titles.

In [57]:
trimmed_dataframe["title"].tolist()
# this line does the same exact thing
# list(dataframe['name'])

['Eminem - Walk On Water (Audio) ft. Beyoncé',
 'PLUSH - Bad Unboxing Fan Mail',
 'Racist Superman | Rudy Mancuso, King Bach & Lele Pons',
 'I Dare You: GOING BALD!?',
 'Ed Sheeran - Perfect (Official Music Video)',
 'Jake Paul Says Alissa Violet CHEATED with LOGAN PAUL! #DramaAlert Team 10 vs  Martinez Twins!',
 'Vanoss Superhero School - New Students',
 'WE WANT TO TALK ABOUT OUR MARRIAGE',
 'THE LOGANG MADE HISTORY. LOL. AGAIN.',
 'Finally Sheldon is winning an argument about the existence of God',
 '21 Savage - Bank Account (Official Music Video)',
 '12 Weird Ways To Sneak Food Into Class / Back To School Pranks',
 '猎场 | Game Of Hunting 12【TV版】（胡歌、張嘉譯、祖峰等主演）',
 'Daang ( Full Video ) | Mankirt Aulakh | Sukh Sanghera | Latest Punjabi Song 2017 | Speed Records',
 'YOUTUBERS REACT TO TOP 10 TWITTER ACCOUNTS OF ALL TIME',
 'I Hired An MI6 Spy To Help Me Disappear',
 'Fake Pet Smart Employee Prank!',
 'Jason Momoa Wows Hugh Grant With Some Dothraki | The Graham Norton Show',
 'Rooster Te

# **Very Useful Tip**: 
Use `type(obj)` to learn about how the dataframe works to help prevent type mismatches.   

Notice that *dataframe.values* is a numpy array. 




In [58]:
type(trimmed_dataframe)

pandas.core.frame.DataFrame

In [59]:
type(trimmed_dataframe.values)

numpy.ndarray

In [60]:
type(trimmed_dataframe.columns)

pandas.core.indexes.base.Index

In [61]:
type(trimmed_dataframe.columns.values.tolist())

list

We can extract the numerical and categorical values from a DataFrame into a Numpy array. 


In [62]:
# slices everything into the a numpy7 data array
data_array = trimmed_dataframe.values[:, :]

QUESTION: Show the key three attributes of the numpy array above (3 points)


In [63]:
trimmed_dataframe.ndim

2

In [64]:
trimmed_dataframe.shape


(40881, 14)

In [65]:
trimmed_dataframe.size


572334

QUESTION: Show the first 5 rows and the last column (3 points)

In [66]:
trimmed_dataframe[["video_error_or_removed"]].head()

Unnamed: 0,video_error_or_removed
0,False
1,False
2,False
3,False
4,False


QUESTION: Sort the df by the most disliked videos. (3 points)

Use .head() for your answer



In [67]:
trimmed_dataframe.sort_values(by='dislikes',ascending=False).head()


Unnamed: 0,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed
5900,17.13.12,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,24,2017-12-06T17:58:51.000Z,"Rewind|""Rewind 2017""|""youtube rewind 2017""|""#Y...",137843120,3014479,1602383,817582,https://i.ytimg.com/vi/FlsCjmMhFmw/default.jpg,False,False,False
5623,17.12.12,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,24,2017-12-06T17:58:51.000Z,"Rewind|""Rewind 2017""|""youtube rewind 2017""|""#Y...",125431369,2912715,1545018,807558,https://i.ytimg.com/vi/FlsCjmMhFmw/default.jpg,False,False,False
5398,17.11.12,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,24,2017-12-06T17:58:51.000Z,"Rewind|""Rewind 2017""|""youtube rewind 2017""|""#Y...",113876217,2811217,1470387,787174,https://i.ytimg.com/vi/FlsCjmMhFmw/default.jpg,False,False,False
5197,17.10.12,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,24,2017-12-06T17:58:51.000Z,"Rewind|""Rewind 2017""|""youtube rewind 2017""|""#Y...",100911567,2656678,1353655,682890,https://i.ytimg.com/vi/FlsCjmMhFmw/default.jpg,False,False,False
4996,17.09.12,YouTube Rewind: The Shape of 2017 | #YouTubeRe...,YouTube Spotlight,24,2017-12-06T17:58:51.000Z,"Rewind|""Rewind 2017""|""youtube rewind 2017""|""#Y...",75969469,2251826,1127811,827755,https://i.ytimg.com/vi/FlsCjmMhFmw/default.jpg,False,False,False


QUESTION: How many videos had > 50,000,000 views (3 points)

In [105]:
np.count_nonzero(data_array[:,6]>50000000)

27

QUESTION: Calculate the average views per video.  

1.   Using np.average (2.5 points)
2.   using df['column'].mean() (2.5 points)







In [131]:
views = np.array(data_array[:,6], dtype=np.int64)
np.average(views)

1147035.9107898534

In [129]:
trimmed_dataframe['views'].mean()

1147035.9107898534

QUESTION: How many videos are by "Ed Sheeran?" (Column name is "channel_title") (3 points)

In [106]:
np.count_nonzero(data_array[:,2]=='Ed Sheeran')

24

QUESTION: Find the number of video titles that include Beyonce (3 points)


In [162]:
trimmed_dataframe[trimmed_dataframe['title'].str.contains('Beyonce')].shape[1]+np.count_nonzero(data_array[:,2]=='Beyonce')

19

QUESTION: Find the Number of Videos that disabled comments (3 points)

In [103]:
np.count_nonzero(data_array[:,11])

583

EXTRA CREDIT: Find the Video with the best Like to Disklike Ratio (5 points)

In [194]:
copy = trimmed_dataframe.copy()
copy = copy[copy['dislikes'] != 0]
copy['likes'] = copy['likes']/copy['dislikes']
copy = copy.sort_values(by='likes',ascending=False)
print("\""+copy.values[0,1]+"\" has likes/dislikes ratio of "+str(copy.values[0,7])+" \n(if video has 0 dislikes then the ratio is 1:0 which is trivial, thus ignored for this calculation)")

"The Reaction of The Streets (I Wait-Day6 Edition)" has likes/dislikes ratio of 2844.3333333333335 
(if video has 0 dislikes then the ratio is 1:0 which is trivial, thus ignored for this calculation)
