- Course Includes:
- NumPy
- SciPy
- Pandas
- SeaBorn
- SciKit-Learn
- MatPlotLib
- Plotly
- PySpark
- Tensorflow Intro
- we use Jupyter Notebooks (have it in Anaconda)
- we use Python3 (have it with Anaconda)
- install jupyter with python3
pip3 install jupyter - Jupyter to install. jupyter works for >40 pls
- we can use an online version
- we install Anaconda
- start notebooks from terminal .
jupyter notebookin the folder where you want it to run - we navigate to the course repo
- For tutorial see the PythonBootcamp for how to
- we can download jupyter files in other formats
- virtual environments allow setting up virtual installations of python and libraries on our computer
- we can have multiple versions of python or libs and activate or deactivate these environments
- handy if we want to program in different versions of the library
- e.g we develop with SciKit v0.17. v0.18 is out. we want to experiment with it but dont want to cause code pbreaks in old code
- sometimes we want our library installations in the correct location (py2.7 py3.5)
- there is virtualenv lib for nomal python distributions
- anaconda has inbuilt virtual environment manager link
- we create a virtual environment with conda named fluffy setting the lib we want to use in it
conda create --name fluffy numpy - to activate thsi environment with
activate fluffyand deactivate it withdeactivate fluffy - if we write
pythonwe get the normal version installed with conda or directly - in the python cli we can import the libs manually
import numpy as npimport pandas as pd - we quit cli with
quit() - we activate our vitrualenv with
activate fluffyenv is activated. if we invloke pythonpythonwe get an old version as per lib we speced. this env has no pandas so we cannot import them we have to install them (while the env is activated) in the given environment - we can set an env for a gibven python version (for testing)
conda create --name mypython3version python=3.5 numpy - we can list our envs with
conda info --envs
Going to skim through this as Python Bootcamp is a complete Python Course
- NumPy is a Linear Algebra Lib for Python
- It is very important for Data Science wwith Python
- Almost all of the libraries in the PyData Ecosystem rely on NumPy as one of their main builing blocks
- NumPy is blazing fast as it binds to C libraries
- It si recommended to install Python with Anaconda to make sure all dependencies are in sync with the conda install
- If we have Anaconda to install NumPy we use
conda install numpyorpip install numpy - NumPy arrays are the main way we use Numpy in the course
- Numpy arrays come in two flavors: vectors and matrices
- Vectors are 1-D arrays
- Matrices are 2-D (also can be 1 D with 1 row or ! column)
- we have a list
my_list = [1,2,3] - we import numpy
import numpy as np - we use numpy array wrapper method to cast the list as an array
np.array(my_list)we get an array backarray([1,2,3])this is an 1D array - if i have a python nested list list-of-lists and use the numpy array wrapper it gets cast to a Matrix or 2D array
np.array([[1,2,3],[4,5,6],[7,8,9]])
>>> array([[1,2,3],
[4,5,6],
[7,8,9]])
- we can use np.arange(start,stop) to generate arrays from a range.
np.arange(0,10)=>array([0,1,2,3,4,5,6,7,8,9]) - we can pass a 3rd argument the stpsize np.arange(start,stop,stepsize) to jump numbers in the range
- if we want to generate fixed num arrays we can use
np.zeros(3)passing a single num to get 1-D arraysarray([0,0,0])or passing tuples to get multidimension matrix arraysnp.zeros((5,5))where the tuple element is the size (1st number rows 2nd number columns) - the number is writen in descriptive form . e.g ones twos, threes
- the method linspace makes an array (1D vector) distributing evenly the range based on the speced num of points
np.linspace(start,end,numofpoints)e.gnp.linspace(0,5,10)=>array([0. , 0.555556, 1.1111111, 1.6666667, 2.22222222, 2,77777778, 3.33333333, 3.88888889, 4.44444444, 5.]). so it produces an array of evenly spaced points between the range - the number of dimensions in an array is determined by the num of brackets
- with nympy we can make an identity matrix I with
np.eye(size)passing the size (num of rows or columns). its 2D and its square, diagonal is ones and rest is zero - with nympy we have a large selection of random methods in
np.random.(hit TAB to see them) - we can create arrays of random number (uniform distribution) passing the size (one argument per dimension size )
np.random.rand(size)ornp.random.rand(sizeX,sizeY) - if we want to return an array of samples of the standar normal or Gaussian distribution use
np.random.randn(size)again here a number per dimension np.random.randint(low,high,size)returns random integer numbers from a low to a high number- Attributes of Arrays:
- we can reshape them passing the new dimensions and sizes
arr = np.arange(25)1 D size25 vector becomes a 5 * 5 matrix witharr.reshape(5,5). if we cant fill the new array completely we get ewrror. the total size must remain Unchanged - we can find min and max value in an array with
my_array.max()or.min() - with
.argmax()and.argmin()we get the location of the min or max .shape()returns the shape of the vector eg. (25,)- we can use
.dtype()to find the datatype
- we can reshape them passing the new dimensions and sizes
- we can reduze the size of method calls by importing the methods from the libs in the namespace
from numpy.random import randint
- we set a range 0-10 array with
arr = np.arange(0,11) - to pick an element in an array is similar to picking from a list
arr[9] - we can use slicing like in lists
arr[1:5]or arr[:5] orarr[::2]to perform jump - an array can broadcast or set a value to a slice
arr[0:2] = 100=> array([100,100,2,3,4,5,6,7,]) - a slice of an array points to the original array. if i get a slice and the set it a fixed var (broadcast) if i go back to the array the broadcast value takes effect there as well.
- if we want a new copy we have to use
arr.copy()and do the changes there - we create a 2d array
arr_2d = np.array([[5,10,15],[20,25,30],[35.40,45]]) - we can select an element of a 2d matrix
arr_2d[0][0]=> 5 row-columnarr_2d[0]returns the whole first row - we can use comma single bracket notation
arr_2d[2,1]=> 40 - we gan get slices or submatrices with
arr_2d[:2,1:]=> array([[10,15],[25,30]]) aggain if we omit a dimension we get the rows slice - we can use conditional selection
arr = np.arange(1,11)we can take this and use conditional selection to turn it to a boolean array. e.gbool_arr = arr > 4isarray([false,false,false,false,true,true,true.., dtype=bool]) - we can use the boolean array for conditional selection
arr[bppl_arr]=>array([5,6,7,8,9,10])so i pass the bool array to the array and filter it out, sekect what is true orarr[arr>5]
- we setup an array
arr = np.arange(0,11) - we can add two arrays eloement by element with
arr + arrsubtract them or multiply them - we can use a scalar broadcasting its effect to all array elements
arr+100=>array([100,101,102,103,104,105,106...])same we can doo - / * ** - numpy throws warnings instead of errors. e.g zero division
- numpy has special word inf for infinity and nan for not a number
- we can use universal operations on arrays
np.sqrt(arr)squareroots every element in the arraynp.exp(arr)raises array to exp andnp.max(arr)finds the maximum value likearr.max() - univ oper list
- to sum all columns in a matrix
mat.sum(axis=0 - see docstring of function in jupyter (shift+tab)
- Pandas is an open source library built on top of NumPy
- It allows for fast analysis and data cleaning and preparation
- It excels in performance and productivity
- It also has buil-in visualization features
- It can work with data from a variety of sources
- we can install pandas by going to your command line or terminal and using either
conda install pandasorpip install pandas - we will see:
- series
- dataframes
- missing data
- groupBy
- merging,joining, concat
- operations
- data IO
- panda datatype like a numpy array
- the difference between the two is that pandas series can be accessed by label (or indexed by label)
- to use pandas we need to import numpy first
import numpy as np
inport pandas as pd
- we create various series from various datatypes.
- first we create standard python lists
labels = ['a','b','c']
my_data = [10,20,30]
- then we cast the values to numpy array
arr = np.array(my_data) - and we set the relationship between the two as a standard python dictionary
d = {'a':10,'b':20,'c':30} - we end up with 4 objects, 2 lists, 1 array, 1 dictionary
- we create the panda series with
pd.Series(data = my_data)setting only the actual data not the labels. this auto indexes with numbers 0,1,2,... - we can specify the index with
pd.Series(data=my_data, index=labels). now the indexes have labels 'a'.'b','c' - so unlike numpy arrays we have labeled indexes. so we can call the data points using the labels
- we can use
pd.Series(my_data,labels)taking care to keep the order - we can create a series passing a numpy array . is like passign a data list
pd.Series(arr,label) - we can create a series passing a python dictionary. this auto creates the labels using the key value relationship. keys become index labels and values the data.
pd.Series(d)<=>pd.Series(my_data,labels) - a series can have any type of data as datapoints.
pd.Series(data=labels)=>dtype: pbject - it can even hold functions as datapoints
pd.Series(data=[sum,print,len]) - grabbing data from a series is similar to grabbing data from a dictionary
ser1 = pd.Series([1,2,3,4],['USA','Germany','USSR','Japan'])ser1['USA']=> 1. i should now the label type thow. if there are no labels i grab them like python lists or numpy arrays - adding series is done based on index.
series1+series2if an index is not common ehat happens is that the result series interpolates non common label datapoints but the val is NaN - when we perform operations on integer datapoints they are converted to floats
- expose the true power of pandas
- we import randn
from numpy.random import randnand seed the generatornp.random.seed(101) - DataFram builds on series and has similar function signature but has a columns argument
df = pd.DataFrame(randn(5,4),['A','B','C','D','E'], ['W','X','Y','Z'] )columnd arg is the label index for columns. so dataframe is for numpy matrices what series is for numpy 1-D vectors.dfprints out the table with axis labels - DataFrame in essence is a set of series sharing an index. we can use indexing and selection to extract series out of a DataFrame e.g
df['W']an alternate way isdf.W - if we pass multiple keys
df['W','Z']we get a sub DataFrame - we can set a new column. not as
df['new']but asdf['new'] = df['W'] + df['Y']we need to assign a defined column to it - we can delete a column by
df.drop('W', axis=1)if i dont specify an axis i can delete a row passing the correct label. drop does not affect the original dataframe. to alter the orginal dataframe we must pass as argument inplace=Truedf.drop('W', axis=1,inplace=True) - to drop a row we can omit axis od set it to 0
df.drop('E',axis=0) - rows have axis = 0 and columns have axis =1 this comes from numpy this is shown in
df.spape()with gives a tuple(5,4)0 index is rows and 1 is columns - to select rows we use
df.loc['A']this returns a series so rows are series like columns. to select locs we canuse index instead of lables with ilocdf.iloc[1] - to select subsets of rows and columns we use loc and numpy notation with comma
df.loc['B','Y']ordf.loc[['A','B'],['X','Y']]
- we see conditional selection and multyindex selection
- we can use operands and broadcast selectors in all datapoints
booldf = df > 0gives the df where all datapoints are boolean - if we pass bool_df in df
df[bool_df]we the df where all datapoints where booldf is false are NaN (null). another way isdf[df>0] - we spec
df['W']>0and get a series of the column with booleans. if I pass it as a conditional selector in dfdf[df['W']>0]this filter is applied in all columns filtering out 3 row - we want to get all the rows of the datafram where Z is <0.
df[df['Z']<0]. we know Z is <0 only in 3rd row so that is what we get back - we can do selction on the fly on the filtered dataframe
df[df['W']>0]['X']ordf[df['W']>0][['Y','X']] - one liners are generaly faster
- we can use multiple conditions,
(df['W']>0) and (df[Y]>1)throuws an error as normal python boolean operators cannot handle series of data, only single bools. he have to use & instead(df['W']>0) & (df[Y]>1)is valid and can be passed ans condition selectordf[(df['W']>0) & (df[Y]>1)]for or we use | - we can reset the index or set it to something else. we use
df.reset_index()to reset the index (now they are not labels but index numbers) and set the labels to a separate column named index. this opeartion does not alter the original dataframe - we can set the index. we set a new list of labels
newind = 'CA NY WY OR CO'.split(). we then set it as columndf['States'] = newindand then st this column as index withdf.set_index('States'). the effect of that is to create a new row for the index column title.
- we look into multindex dataframes and multilevel index.
- we do multiIndex as follows:
- we create two lists
outside = ['G1','G1','G1','G2','G2','G2'] inside = [1,2,3,1,2,3]- we zip them to a list of tuples
hier_index = list(zip(outside,inside))- we create a dataframe multiindex using the special pandas method MultiIndex specifing we will use tuples
hier_index = pd.MultiIndex.from_tuples(hier_index)- we get a MultiIndex object with levels and labels
- we create a dataframe with 6by2 random data, the hier_index and lables for the 2 columns
df = pd.DataFrame(randn(6,2),hier_index, ['A','B'])- the result is a 6 by 2 dataframe with 2 columns for indexes, the first w/ lables G1 and G2 and the second with labels 1,2,3 all grouped according to their appearance on the lists we declared them. the two columns are called levels
- we can call data from multiindexed dataframes
df.loc['G1']and we can chain multiple levelsdf.loc['G1'].loc[1] - we can get and set the names of the index level columns: get
df.index.namesretrieves a FrozenList with None. to set we usedf.index.names = ['Groups','Num']. the index labels are in a new extra row - we can select a specific element with
df.loc['G2'].loc[[2]['B'] - we can get a crossSection or rows/columns with xs which we can use when we have m,ultilevel index.
df.xs('G1')is likedf.loc['G1']but xs has the ability to go insiude the multilevel index - say we want all values with num index = 1
df.xs(1,level='Num')
- we define a dictionary to pupulate a dataFrame
d = {'A':[1,2,np.nan],'B':[5,np.nan,np.nan],'C':[1,2,3]}filling in as values lists with null valuesnp.nan - we use it to create a datafram
df = pd.dataFrame(d) - we drop the missing values with
df.dropna(). this drops any row that has missing value(s). if we pass axis=1 de drop the columns with missing values - we can pass a threshold argument to dropna to set the minimum number of null required to drop the row or column
df.dropna(thresh=2) - we can fill in missing values with fillna
df.fillna(value='FILL VALUE')will pass the FILL VALUE string to any null datapoint - we can fill the mean of a column to a missing datapoint
df['A'].fillna(value=df['A'].mean())
- groupby is used to produce aggregates, it allows to group together rows based off of a column and perform an aggregate function on them. e.g we can groupby id
- we use a dictionary to produce a dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
- we have a column based on which we can group sales. Company
byComp = df.groupby('Company'). this produces a groupby object not a df. - the way to use groupby objects is to apply aggregate functions on them
byComp.mean() - we can use many aggregate functions like
.sum() .std() - the result of the aggregate function is a dataframe. we can use df methods on it
byComp.sum().loc['FB'] - usually we produce oneliners by chaining methods
- usually aggregate methods operate an return only number columns to the newdataframe. some can operate on strings and return result columns for them e.g .count() .max() .min()
- if insteade of an aggregate method we chain a
.descripe()method to a groupby object, we get many aggregate results in the resulting df. we can chain after the aggregate or desgribe.transpose()to view them as a row rather than column
- we create 3 dataframes (df1,df2,df3) with same labeled columns ('A'.'B','C','D'), continuous indexes and datapoints strings that represend the combination of column and index e.g 'B5'
- we concatenate them (glue them together, stacking up rows) with
pd.concat([df1,df2,df3]). the dimensions should match on the axis we are concatenating on. here we dont spec axis so we concatenate along the row (axis=0) which has size 4 for all - we can concatenate along the column with
pd.concat([df1,df2,df3], axis=1). the rule about size applies. missing values are NaN. the result is a dataframe - we create two new dataframes with same indexes, columns that have continuous letters labels and a key column with the same datapoints in each
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
- we can use merge to merge dataframes in the same way we merge SQL tables.
- we do it with
pd.merge(left,right,how='inner,on='key'). the default how is 'inner' like innerjoin, on is where the merge takes place - we can do inner merge on multiple keys where the merge take place where the compination of keys is the same for both dataframes
- an outer merge stacks up all combinations of keys
- there is also right or left merge
- joining is a conveninet method of combiningthe columns of two potentially different-indexed DataFrames into a single result DataFrame
- id our dataframes are df1 and df2 we join them
d1.join(df2)innerjoin. the outer join isdf1.join(df2,how='outer'). inner join between a range of keys that are common, outer hjoin join all.
- we create dataframe
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})with 3 columns and unlabeled index - to find unique values in dataframes in a column we use
df['col2'].unique()which returns a numpy array - to count the unique values
len(df['col2'].unique())ordf['col2'].nunique() - to get counts for the occurence of a value in a column we use
df['col2'].value_counts()which returns a 2-d numpy matrix with the value and counter - to select data from a df we can use conditional selection or the apply method
- we define a methodL
def times2(x):
return x*2
- to broadcast a function to a dataframe we use apply. e.g
df['col1'].apply(times2)broadcasts its to the col1 values outputing a matrix (numpy) - we can apply built in methods or even lambdas
df['col2'].apply(lambda x: x*2) - we can get the column label with
df.columnswhich gives an Index object with column labels or for indexdf.index - we can sort values in a column with
df.sort_values('col2'). index follows the sort - to find the nulls in the datafram we can use
df.isnull() - we create dataframe with repeating values
- we can create pivot tables out of a dataframe with
df.pivot_table(values='D',index=['A','B'],columns=['C']). this will be a multiindex dataframe.
- to input data from external sources we need to install extra libraries.
conda install sqlalchemy
conda install lxml
conda install html5lib
conda install BautifulSoup4
- we will io data from CSV,Excel,HTML,SQL
import pandas as pd- we find our current dir in Jupyter with 'pwd'
- we have a csv file named example with the following content
a,b,c,d
0,1,2,3
4,5,6,7
8,9,10,11
12,13,14,15
- we read a csv named example in the current dir.
df = pd.read_csv('example)` which gets imported and stored as a dataframe - pandas can read from a wide variety of file formats with
.read_<filetype>(<filename>) - we can store to a file e.g csv with
df.to_csv('my_output', index=False)we set index to false as we dont eant to save the index as a separate column - pandas can read from excel but read only the data. not formulas or macros. reading excels with formulas will make it to crash
- to read from excel we install
conda install xlrd(installed with anaconda) or with pip - we have an excel file Excel_sample.xlsx with same data as the csv in Sheet1
- pandas treats excel as a set of dataframes (a dataframe per sheet)
- the expression to read is
df = pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')which stores the input in a dataframe - we can store a dtaframe to an excel sheet
df.to_excel('Excel_Sample2.xlsx',sheet_name='NewSheet') - to read from HTML we need to install some libraries and restart jupyter notebook
conda install lxml
conda install html5lib
conda install BautifulSoup4
- we will read from the page http://www.fdic.gov/bank/individual/failed/banklist.html whoch contains a table
- we read with
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')which stores data in a list where df[0] is a DataFrame - panda is not the best way to read sql as there are many libs for each SQL DB flavor
- we create a sqlengine
from sqlalchemy import create_engine - then we create an imemory sqlite db with the engine to test our code
engine = create_engine('sqlite:///:memory:') - we store a dataframe as a sql table in our test db
df.to_sql('data', engine)data is the name of the table - we readthe table and store it as a dataframe
sql_df = pd.read_sql('data',con=engine)
- we get datasets from kaggle
- Check the head of the DataFrame
sal.head() - Use the .info() method to find out how many entries there are.
sal.info() - What is the average BasePay ?
sal['BasePay'].mean() - What is the highest amount of OvertimePay in the dataset ?
sal['OvertimePay'].max() - What is the job title of JOSEPH DRISCOLL ?
sal[sal['EmployeeName']=='JOSEPH DRISCOLL']['JobTitle'] - How much does JOSEPH DRISCOLL make (including benefits)? sal[sal['EmployeeName']=='JOSEPH DRISCOLL']['TotalPayBenefits']`
- What is the name of highest paid person (including benefits)? sal[sal['TotalPayBenefits'] == max(sal['TotalPayBenefits'])]['EmployeeName']`
- alternative mehod to get max
sal.lov[sal['TotalPayBenefits'].idxmax()]orsal.iloc[sal['TotalPayBenefits'].argmax()] - What is the name of lowest paid person (including benefits)? Do you notice something strange about how much he or she is paid?
sal[sal['TotalPayBenefits'] == min(sal['TotalPayBenefits'])] - What was the average (mean) BasePay of all employees per year? (2011-2014) ?
sal.groupby('Year')['BasePay'].mean() - How many unique job titles are there?
len(sal['JobTitle'].unique())orsal['JobTitle'].nunique() - What are the top 5 most common jobs?
sal.groupby('JobTitle')['JobTitle'].count().sort_values(ascending=False)[:5]alternativelysal['JobTitle'].value_counts().head(5) - How many Job Titles were represented by only one person in 2013?
usal = sal[(sal['Year']==2013)].groupby('JobTitle')['JobTitle'].count()
len(usal[usal == 1])
alternatively sum(sal[sal['Year']==2013]['JobTitle'].value_counts() == 1)
- How many people have the word Chief in their job title? (This is pretty tricky)
- my solution
ysal = sal.drop_duplicates(["EmployeeName", "Year"]) ysal[ysal['JobTitle'].str.lower().str.contains('chief')].shape[0]- teacher solution
def chief_string(title): if 'chief' in title.lower().split(): return True else: return False sum(sal['JobTitle'].apply(lambda x: chief_string(x))) - Bonus: Is there a correlation between length of the Job Title string and Salary?
# we make a new column for title length
# we use method corr() for correlation
sal['title_len'] = sal['JobTitle'].apply(len)
sal['TotalPayBenefits','title_len'].corr()
- What is the average Purchase Price?
ecom['Purchase Price'].mean() - What were the highest and lowest purchase prices?
ecom['Purchase Price'].max()
ecom['Purchase Price'].min()
- How many people have English 'en' as their Language of choice on the website?
ecom[ecom['Language'] == 'en'].shape[0]orecom['Language'] == 'en']['Language'].count() - How many people have the job title of "Lawyer" ?
ecom[ecom['Job'] == 'Lawyer'].shape[0] - How many people made the purchase during the AM and how many people made the purchase during PM ?
ecom['AM or PM'].value_counts() - What are the 5 most common Job Titles?
ecom['Job'].value_counts().head(5) - Someone made a purchase that came from Lot: "90 WT" , what was the Purchase Price for this transaction?
ecom[ecom['Lot'] == '90 WT']['Purchase Price'] - What is the email of the person with the following Credit Card Number: 4926535242672853
ecom[ecom['Credit Card'] == 4926535242672853]['Email'] - How many people have American Express as their Credit Card Provider and made a purchase above $95 ?
ecom[(ecom['CC Provider'] == 'American Express') & (ecom['Purchase Price'] > 95.0)].shape[0] - Hard: How many people have a credit card that expires in 2025?
sum(ecom['CC Exp Date'].apply(lambda x: x[3:]=='25)) - Hard: What are the top 5 most popular email providers/hosts (e.g. gmail.com, yahoo.com, etc...)
ecom['Email'].str.split('@',1,expand=True)[1].value_counts().head(5)other solutionecom['Email].apply(lambda email: email.split('@')[1]).value_counts().head(5) - to get column names
df.columns
Lecture 40 - Introduction to Matplotlib
- most popular plotting library for Python
- it gives vontrol ofer every aspect of a figure
- designed to give Matlab plotting look and feel
- works well with pandas and numpy
- to install it
conda install matplotliborpip install matplotlib - in project site => examples we can see examples of the plots with code examples
- we import matplotlib
import matplotlib.pyplot as plt - to use it efficiently in jupyter we need to enter
%matplotlib inline. this will allow us to see our plots as we write them in notebook - if we dont use jupyter after our code we need to enter
plt.show()before executing it - we will use numpy for our first example
import numpy as np
x = np.linspace(0,5,11)
y = x**2
- there are 2 ways to create plots, functional and object oriented
# Functional
plt.plot(x,y) # shows the plot
plt.show()in jupyter prints the plot- we can add matlab style arguments like color and style
plt.plot(x,y,'r--')for red and dashed - if we want to add labels we use
plt.xlabel('my labbel')orplt.ylabel('my labbel') - for plot title
plt.title('My title') - we can create multiplots in the same canvas using the subplot method subplot takes 3 arguments: num of rows, num of columns, and the plot num we refer to
plt.subplot(1,2,1)
plt.plot(x,y,'r')
plt.subplot(1,2,2)
plt.plot(y,x,'b')
- we will now have alook at the OO method: we create figure objects and call methods on them
fig = plt.figure() # creates afigure object like a cnavas
axes = fig.add_axes([0.1,0.1,0.8,0.8])
- we add axis with a method passing a list of axis. the list takes 4 args: left, bottom, width and height arg, which take a val from 0 - 1 (represent percentage of screen) left and bottom are coordinates and wight and height size
- we then plot on the axes to see the plot in our set of axes
axes.plot(x,y). the plot is the same as before but in an OO approach - we can put labels like before on the set of axes
axes.set_xlabel('Set X Label') # Notice the use of set_ to begin methods
axes.set_ylabel('Set y Label')
axes.set_title('Set Title')
- we will put 2 sets of figures on the same canvas using OO
fig = plt.figure()
axes1 = fig.add_axes([0.1,0.1,0.8,0.8]) # for first plot
axes2 = fig.add_axes([0.2,0.4,0.4,0.3]) # for second plot
- the plots are overlayed one over other as the axes are overlapping
- we plot on the axes and the advantage of OO approach becomes apparent
axes1.plot(x,y)
axes2.plot(y,x)
- we will create subplots using OO
fig, axes = plt.subplots() - by using
axes.plot(x,ywe get a plot like before - with subplots we can specify rows and columns
fig, axes = plt.subplots(nroows=1,ncols=2), now we have two empty canavases to apply our plots as 2 columns. - so subplot manages the axes automaticaly. the fig,axes syntax is tuple unpacking
- axes is actually an array of two (1 per subplot). we can iterate through it or index it
for current_ax in axes:
current_ax.plot(x,y)
# OR INDEX EQUIVALENT
axes[0].plot(x,y)
axes[1].plot(x,y)
- each axes element is an object where we can call the methods we have seen before
- at the end of the plot statements (especially when we use subplots) we should use
plt.tight_layout() - We will now see figure size, aspect ratio and DPI. matplotlib allows control over them
- we can control figure size, dpi
fig = plt.figure(figsize=(3,2),dpi=100)usually we dont set dpi (analysis) in jupyter. the tuple we put in figsize is (inches_width,inches_height). we can set the figsize to subplots - to save a figure to a file we use savefig
fig.savefig('mu_pickture.jpg,dpi=200) . the filename defines the file format - we can add legends to our plots to clarify the plot. we do it by adding
axes.legend()as last statement wehre axes are defined - for this to work we need to add a label argument to our plots
axes.plot(x,y,label='X in power of 2') - we can position the legend passing a param in legend() e.g axes.legend(loc=0) aka best position or passing a tuple for bottom left postiion in % (0 to 1)
- seting colors in plot. we pass them as argument colors in plot as strings . they are cpecked as literals or RGB hex e.g
ax.plot(x,y,color="green")orax.plot(x,y,color="#FF00FF') - we can set plot line width (in px) or line style
ax.plot(x,y,linewidth=20)orax.plot(x,y,lw=20). we can also control the a (transparency)ax.plot(x,y,alpha=0.5) - linestyle is again passed as param
linestyle="--"orls="--"or ":" or "steps" or ... - markers are used when we have a few datapoints. we set them as argumetns in plot
ax.plot(x,y,marker="*")or any other mark we want. we can set their marakersize with the plot argumentmarakersize=13or any int - we can customize markers further with
markerfacecolor="yellow", markeredgewidth=3,markeredgecolor="green" - we can limit the x and y in our plots
axes.set_xlim([0,1])passing lower bound and upper bound or ylim limiting the axis in real values - matplotlib supports a lot of plot types (see links and tutorial)
- advanced topics avaialble as a bonus notebook
- all commands on a plot must be in one cell
Lecture 46 - Introduction to Seaborn
- seaborn is a statistical ploting library with beautiful default styles
- works very well with panda dataframe objects
- we install it with
conda install seabornorpip install seaborn - there are many examples in docs site
- we will analyze distribyution of a dataset
- we import seaborn
import seaborn as sns - we set matplotlib as inline (seaborn use it)
%matplotlib inline - seaborn comes with inbuilt datasets for testing
tips = sns.load_dataset('tips')tips is one of them a simple dataframe about tips - for distplot we pass a single column of our dataframe. it shows uniform distribution
sns.distplot(tips['total_bill'])we get a histogram and a KDE (kernel density estimation or probability density function of a random variable) - we can remove the kde and keep only th histogram by setting it to false
sns.distplot(tips['total_bill'],kde=False) - the histogram is a distribution chart which shows where our values lay on the value min max range. the x range is split into areas or bins and the y axis is a count, we can change the number of bins
sns.distplot(tips['total_bill'],kde=False,bins=30). if our bins value is too high we plot each value - jointplot allow us to match two distplots of bivarial data, we pass an x variable and ay variable and our dataset(dataframe), usually x and y are column labels
sns.jointplot(x='total_bill',y='tip',data=tips) - what we get is 2 distplots on each axis and a scaterplot to show the correlation, when bill is higher tip is higher also
- in jointplot we can set an addional paramter called kind
sns.jointplot(x='total_bill',y='tip',data=tips,kind='scatter')the default is scatter. we can use hex instead to get a hexagon distribution representation (more points, darker color) - we can also set kind="reg" which gives a scatterplot with a regression line overlaid (linear regression) like a linear fit. we can also put kind="kde" to get a 2D KDE plot showing the density
- pairplot shows pairlike relationships within an entire dataset(dataframe). it also supports a color hue param for categorical data. what pairplot does is essential every possible combination of jointplots in a datasets numerical values
sns.pairplot(tips). - when doing a plot of the same column it shows a histogram instead the rest is scatterplot
- if we add the param hue="sex" we get colored plot showing the 2 different categories of sex with different color. this is a great way to add in the mix non numeral categorical data (passing their column label)
- we can choose a palette for hue
sns.pairplot(tips,hue='sex',palette='coolwarm') - next plot we look at is rugplot. in rugplot we pass a single column
sns.rugplot(tips["total_bill"]). rugplot plot 1 dash for every datapoint in the column with the x axis being the value range.dash stack on each other. the distplot counts the values adding them to bins. the rugplot shows a density like representation of the datapoints - so how we build the kde plot from rugplot? kde stands for Kernel Density Estimation plot. from rugplot and datapoints gaussian (normal) distributions are calculated. their sum is the KDE plot
- in an example we make a random data dataset => make a rugplot of them => set axis for the plot => use linspace to split the axis => plot a normal distribution for all the rugplot points => we sum them up to get the kde plot
- we again import seaborn and set matplotlib inline
- in these plots we are interested in the distribution of categorical data
- we start with barplot, where we aggregate categorical data based on some function (default is mean)
sns.barplot(x=,y=.data=tips)we again set the x axis the y axis and the dataset. usually x is the categorical column label e.g "sex" and for y we choose a column that is numeric e.g "total_bill". what we get a sdefault is 2 bars with the mean or average total bill value for male and female customers. - we can add an other aggregate function instead of the default with specifying the estimator parameter
sns.barplot(x='sex',y='total_bill',data=tips,estimator=np.std)for standard deviation. we can use a custom function of our own - a countplot in seaborn is a barplot where in the y axis we have the counter of occurences of each category in the dataset
sns.countplot(x='sex',data=tips) - boxplot (or box and wisker plot) is used to show the distribution of categorical data. again we set x, y and data param like boxplot
sns.boxplot(x="day", y="total_bill", data=tips). boxplot shows boxes for quartile of distribution. the wiskers show the rest of the distribution while the dots outside outliers. if we add a hue as parameter passing another categorical column label our boxplot is split showing boxplots for each category of hue. hue is excellent ofn adding insight in analysis - violiplot also shows distribution of categorical data, arguments are exactly like boxplot
sns.violinplot(x="day", y="total_bill", data=tips). violinplot allows plotting of all datapoints in the dataset in a continuous diagram showing the kde of the underline distribution. its harder to read, violinplot also accepts hue param. one cool effect is that with hue and with split=True in violi plot we get per violit plot both hues one per side (as the original is symmetrical on y axis)sns.violinplot(x="day", y="total_bill", data=tips,hue='sex',split=True) - a stripplot is a scatterplot of categorical data. the basic params are same as the other ones
sns.stripplot(x="day", y="total_bill", data=tips). with this plot we cannot tell how many points stack on each other. we can use jitter param for this jitter=True which makes the line thicker to show all points. again hue is supported and split - the idea of stripplot and violinplot is combined in swarmplot. its as triplot where poitns are stackedvertically so that they dont overlap giving the violin shape. params are the same
sns.swarmplot(x="day", y="total_bill", data=tips) - swarmplots do not scale well for large numbers(too wide).
- we can stack a swarmplot on a violin plot to show where the kdes came from
- factorplots take x and y argument and dataset and the kind of plot we want. eg. kind="bar" for barplot
sns.factorplot(x='sex',y='total_bill',data=tips,kind='bar')
- we load 2 built-in datasets from seaborn tips and flights
flights = sns.load_dataset('flights')
tips = sns.load_dataset('tips')
- we will look into heatmap plot. in order to work properly our data should be in matrix form (index name and column name match up) so that the cell value shows somthing relevant to both names.
- so in tips dataset our columns are labeled but the rows not. in order to make it in matrix form our index should have label relevant to the cell. this is done by applying pivot or correlation transformation to the dataset.
tc = tips.corr()now both rows and columns are labeled - with data in matrix form i just have to call
sns.heatmap(tc). heatmap color values based on gradient scale. we can set the scale with cmap param cmap="coolwarm" we can annotate the plot with param annot=True this overlays the values on the tiles - we want to transform the flights dataset so that index is the month, column the year and datavalues the passengers. we use pivot
flights.pivot_table(index="month",columns='year',values="passengers"). we get it in matrix form so we can show it in heatmap plot. we can also set linecolor and linewidth params to seperate cells in heatmap - we also have clustermap matrix plot which is also applied on matrix data sets
snsclustermap(tc) - cluster map has hierarchical clusters clustering rows and columns together based on their similarity. this breaks the original order of data but still contains valid representations. Clustermap searches Similarities
- we can change the representation by normalizing data changing the scale
sns.clustermap(pvflights,cmap='coolwarm',standard_scale=1)adding a standard_scale param (1 is 0-1 stale)
- we again import seaborn and se matplotlib inline
- we use the inbuilt iris dataset
iris = sns.load_dataset('iris') - it contains measurements of different flowers and species has categorical data of 3 distinct values
- we plot the pairplot with
sns.pairplot(iris)(grid of scatterplots of numericalvalues seen before) - another plot is PairGrid. with
g = sns.PairGrid(iris)we print only the grid - pairplot is a simplyfied version of PairGrid. PairGrid needs tweaking but gives more flexibility. to mat a scatterplot on the pairgrid we need to import matplotlib, set the grid (passing the dataset) and the on it map the scatter
import matplotlib.pyplot as plt
g = sns.PairGrid(iris)
g.map(plt.scatter)
- this produces a grid of scatterplots. its like pairplot. only that its missing the histograms along the axes (it has scatters)
- with PairGrid we have control on the plots we will show on the grid
g.map_diag(sns.distplot) # distplot (histogram) on thediagonal
g.map_upper(plt.scatter) # scatterplot on upper half of grid
g.map_lower(sns.kdeplot) # kde plot on the lower half
- for facet grid plot we will use the tips dataset
tips = sns.load_dataset('tips) - for facet grid we specify the data the column and the row like sublplots in matplotlib
g = sns.FacetGrid(data=tips,col='time',row='smoker'). the expression produces an empty grid (2 by 2) - on the grid we do
g.map(sns.distplot,'total_bill') - this shows a distplot on each grid cell with the distribution of total_bill on the 4 cases defined by the combination of the 2 category types of each dataset column we chose (time and smoker)
- the data we pass is subplot dependent. so for scatter we need two arrays
g = g.map(plt.scatter, "total_bill", "tip")
- we import seaborn, set matplotlib inline and load the tips inbuilt dataset
- we use lmplot for linear regression. we set the feat we want on the x axis and the y axis
sns.lmplot(x='total_bill',y='tip',data=tips)what we get is a scatterplot witha linear fit on top. we can spec hue based on sex (we get 2 scatterplots and 2 linear fits on top of each other). we can even set different markers for our hue categories using the parameter markers=['o','v'] to control the markers size we use scatter_kws={'s':100}) - so seaborn uses matlibplot under the hood and with scatter_kws we control the scatterplot of matplotlib passing the dictionary s is size (see documentation)
- instead of separating categories by hue we can use grid
sns.lmplot(x='total_bill',y='tip',data=tips,col='sex')by passing the col parameter. we can add a row param for an other category (like facet grid but simpler) - we can combine grid and hue
- size and aspect ratio is adjustable in seaborn with aspect=0.6,size=8 params
- we import seaborn, se matplotlib inline and load the tips dataset from seaborn
- we do a simple plot
sns.countplot(x='sex',tips) - seaborn has a set_style method to set the style for all our plots
sns.set_style('white')darkgrid, whitegrid, ticks etc are acceptable values - we can remove spines
sns.despine()the top and right. we can remove bottom and left by specifying itsns.despine(left=True) - we can control the size and aspect ration by specifying it
- in a non grid plot we can pec it in the matplotlib figure like we have seen
plt.figure(figsize=(12,3))before our actual plot - in grid pplots we do it in seaborn
sns.lmplot(x='total_bill',y='tip',size=2,aspect=4,data=tips) - we can set the context with
sns.set_context('poster'.font_size=4)with poster we get a much largez size suitable to be put on a poster, default is notebook - coloring can be controled with palette parameter. possible values
- we work with the titanic dataset
- we import numpy and pandas in our notebook and set natplotlib as inline
- we upload 2 csv files using pandas
df1 = pd.read_csv('df1',index_col=0) # remove index column (the imported dataframe has as index dates)
df2 = pd.read_csv('df2')
- say we want a histogram of all the values in a column of df1. pandas can do it with
df1['A'].hist(). it calls matplotlib under the hood. like seaborn we can set the number of bins in the histogram as a param bins= - if we dont like the style we can import seaborn in our notebook and the pandas plots are styled like seaborn
- pandas have several inbuilt plots. all of them are called as methods on the dataframe
- we can use the general method plot specifying the kind of plot + params
df1['A'].plot(kind='hist',bins=30) - another way is to chain plot ype as method
df['A'].plot.hist(bins=30) - we can have an areaplot of multiple num columns in a dataframe, even the entire dataframe
df2.plot.area(), we add transpoarency specifying alpha as a param (0 -1 value) alpha=0.4 - also we can have a barplot
df2.plot.bar()of all rows (indexes). we can hacve a stacked barplot passing the parameter stacked=True - histograms are most use in pandas builtin
- lineplot is also available. we need to specify the x and y axis for this
df1.plot.line(x=df1.index,y='B') - we can use matplotlib arguments to control the appearance figsize=(12,3) or lw markers etc
- we can do scatterplots with pandas.
df1.plot.scatter(x='A',y='B')we can set color dependent on another column with c='C'. we can set cmap as well like in seaborn. instead of color we can set dor size dependent on a COlumns values with *s=df1['C']200 passing a factor to multiply size - we can do a boxplot of a dataframe with
df2.plot.box() - he can ddo hexbin plots. passing x and y columns. we use a bigdtaset dfor this, we can multiply the hex size fy a factor
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df.plot.hexbin(x='a',y='b',gridsize=25,cmap='Oranges')
- we can also do kdeplots with
df['a'].plot.kde()od density plotsdf2.plot.density()which is the same thing
- we can limit the ampount of indexed we feed in a plot with ix[:max]
df3.ix[:30].plot.area() - we can print legend outside of plot with
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
- plotly and cufflinks allow us to create interactive visualizations
- Plotly is an interactive visualization library
- Cufflinks connects plotly with pandas
- we need to install the libs
pip install plotlyandpip install cufflinksnot available in conda - we create a virtual env
conda create -n plotly plotly - we activate it
source activate plotly - we install cufflinks in it
pip install cufflinks - we install numpy
conda install -n plotly numpy - we install pandas
conda install -n plotly pandas - we install matplotlib
conda install -n plotly matplotlib - we install seaborn
conda install -n plotly seaborn - we run
anaconda-navigatorfrom it - we select our new environment in navigator
- we install notebook in it
- we import numpy and pandas and set matplotlib inline
- we check the plotly version
from plotly import __version__
print(__version__)
- we need a version > 1/9
- we import cufflinks
import cufflinks as cf - we import plotly libs
from plotly.offline import download_plotlyjs, init_notebook_mode,plot,iplot init_notebook_mode(connected=True)connects javascript to our notebook, as plotly connects pandas and python to an interactive JS librarycf.go_offline()to tell cufflinks to work offline- we get some data for our plots
df = pd.DataFrame(np.random.randn(100,4),columns='A B C D'.split()) - we createa sceond dataframe using a dictionary
df2 = pd.DataFrame({'Category':['A','B','C'],'Values':[32,43,50]}) - if we write
df.plot()pandas plots a matplotlib plot (a line plot with hue for each column) - if insteat we use iplot
df.iplot()we get the same plot but from plotly. and is interactive, we can zoom pan, use tools - we can use iplot for a scatterplot specifing kind="scatter" and defining the 2 axes by col name
df.iplot(kind="scatter,x="A",y="B")this is a line scatterplot!?! .. we fix that by adding param mode="markers", we can also affect size of marks with size= - we do a barplot on df wher x is a categorical column (Category) and y anumerical column (Values)
df2.iplot(kind="bar", x="Category",y="Values") - we can use grpup by or perform aggregate funtions on our dataframe to bring it to a form where we can plot it.
df.sum().iplot(kind="bar") - for boxplots
df.iplot(kind="box") - we can do 3d surface plot using a new dataframe
df3=pd.DataFrame('x':[1,2,3,4,5],'y':[10,20,30,40,50],'z':[100,200,300,400,500]) - we plot the surface plot with
df3.iplot(kind="surface")ce can set a colorscale with colorscale="rdylby" - we can do histogram plots
df['A'].iplot(kind='hist',bins=345). if we pass al the dataframe we get overlapping histogram of all columns - we can use spread type visualization (for stock analysis) df['A','B'].iplot(kind="spread"). we get a linechart and a spread (area PLOT) beneath
- we can use bubblieplot (like a spreadplot but size of marks show a value) so we specify the column for size
df.iplot(kind="bubble",x="A",y="B",size="C")popular in UN reports - scatter matrix is like a seaborn pairplot. is is applied on a datafram
df.scatter_matix() - cufflinks can do technical analysis plots for financial (beta)
- geo plotting is difficult due to the various formats the data can come in
- we will focus on using plotly for plotting
- matplotlib has also a basemap extension
- usualy the dataset determines the library we will use
- we should keep the notebook of the course as refrence as chloropleths are difficult to master
- we import plotly
import plotly.plotly as py - we import plotly libs
from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot - we set noteboom mode
init_notebook_mode(connected=True) - we start by setting our configuration object as a dictionary
data = dict(type='cloropleth,locations=['AZ','CA','NY'], locationmode='USA-states',colorscale="Portland",text=['text 1','text 2','text 3'], z = [1.0 ,2.0 ,3.0],colorbar={'title':'Colorbar Title'})`. text is the text shown when we hover over the area , z are the actual values, colorscale is the colorscale we use, colorbar sets configfor colorscale legend (title), locations are a way to identify the areas on the colorpleth (standardized) and the locationmode is the type of map - next we create the layout as a dictionary passing the map scope.
layout = dict(geo = { 'scope':'usa'}) - we neer to import the geo library from plotly to plot our cloropleth
import plotly.graph_objs as go - we use the FIgure method (constructor) from go to instantiate our choropleth
choromap = go.Figure(data=[data],layout = layout)note we pass data dictionary in an array - we plot with
iplot(choromap)we can plot a non-interactive map for printing withplot(choromap) - for choropleth we plot a Figure that uses two objects . data and layout both dictionaries
- in data type sets the type of plor. locations and location mode are interlinked to identify areas. text z and locations have 1:1 relationship within them
- we will use real data this time
df = pd.read_csv('2011_US_AGRI_Exports') - we set our data dict using columns from dataframe
- we also add a new argument. a marker as a nested dictionary seting its line as a nested dicxtionary
data = dict(type='choropleth',
colorscale = 'YIOrRd',
locations = df['code'],
z = df['total exports'],
locationmode = 'USA-states',
text = df['text'],
marker = dict(line = dict(color = 'rgb(255,255,255)',width = 2)),
colorbar = {'title':"Millions USD"}
)
- we set the layout passing a title we also set the lakes?!
layout = dict(title = '2011 US Agriculture Exports by State',
geo = dict(scope='usa',
showlakes = True,
lakecolor = 'rgb(85,173,240)')
)
- we then instantiate the figure and plot it
- the marker sets a line between states
- we load a new dataset
df = pd.read_csv('2014_World_GDP') - we set the data config dictioanary
data = dict(
type = 'choropleth',
locations = df['CODE'],
z = df['GDP (BILLIONS)'],
text = df['COUNTRY'],
colorbar = {'title' : 'GDP Billions US'},
)
- there is no location mode (only countries) using international CODE. we again use columns from df
- we set the layout setting the geo in a new way
layout = dict(
title = '2014 Global GDP',
geo = dict(
showframe = False,
projection = {'type':'Mercator'}
)
)
- we set the figure and plot
- see docs for more
- with choropleth we need our dataset to contain the countrcodes or statecodes it understands
- we can set
locationmode="country names",to use full country names - we can set reversescale=True to reverse color scale
- convert column from string to timestamp
df['timeStamp'] = pd.to_datetime(df['timeStamp']) - What are the top 5 zipcodes for 911 calls?
df['zip'].value_counts().head(5) - What are the top 5 townships (twp) for 911 calls?
df['twp'].value_counts().head(5) - Take a look at the 'title' column, how many unique title codes are there?
df['title'].unique().shape[0] - In the titles column there are "Reasons/Departments" specified before the title code. These are EMS, Fire, and Traffic. Use .apply() with a custom lambda expression to create a new column called "Reason" that contains this string value.
df['Reason'] = df['title'].apply(lambda x: x.split(':')[0]) - What is the most common Reason for a 911 call based off of this new column?
df['Reason'].value_counts() - Now use seaborn to create a countplot of 911 calls by Reason.
sns.countplot(x='Reason',data=df) - You should have seen that these timestamps are still strings. Use pd.to_datetime to convert the column from strings to DateTime objects.
df['timeStamp'] = pd.to_datetime(df['timeStamp']) - You can use Jupyter's tab method to explore the various attributes you can call. Now that the timestamp column are actually DateTime objects, use .apply() to create 3 new columns called Hour, Month, and Day of Week. You will create these columns based off of the timeStamp column, reference the solutions if you get stuck on this step.
df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)
- Notice how the Day of Week is an integer 0-6. Use the .map() with this dictionary to map the actual string names to the day of the week:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['Day of Week'] = df['Day of Week'].apply(lambda x: dmap[x])
- Now use seaborn to create a countplot of the Day of Week column with the hue based off of the Reason column.
sns.countplot(x='Day of Week',data=df,hue='Reason')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
- Now do the same for Month:
sns.countplot(x='Month',data=df,hue='Reason')
plt.legend(loc='center left', bbox_to_anchor=(1.0, 0.5))
- Now create a gropuby object called byMonth, where you group the DataFrame by the month column and use the count() method for aggregation. Use the head() method on this returned DataFrame.
byMonth = df.groupby('Month').count()
byMonth.reset_index(level=0, inplace=True)
byMonth.head()
- Now create a simple plot off of the dataframe indicating the count of calls per month.
plt.xlabel('Month')
plt.ylabel('Num of Calls')
plt.title('Number of 911 calls per Month')
plt.grid(True)
plt.plot(byMonth['Month'],byMonth['e'].values)
- Now see if you can use seaborn's lmplot() to create a linear fit on the number of calls per month. Keep in mind you may need to reset the index to a column.
sns.lmplot(x='Month',y='e',data=byMonth) - Create a new column called 'Date' that contains the date from the timeStamp column. You'll need to use apply along with the .date() method.
df['Date'] = df['timeStamp'].apply(lambda x: x.date()) - Now groupby this Date column with the count() aggregate and create a plot of counts of 911 calls.
byDate = df.groupby('Date').count()
plt.figure(figsize=(12,3))
plt.xlabel('Date')
plt.ylabel('Num of Calls')
plt.title('Total Calls per Day')
plt.grid(True)
plt.plot(byDate.index,byDate['e'])
- Now recreate this plot but create 3 separate plots with each plot representing a Reason for the 911 call
traffic = df[df['Reason']=='Traffic'].groupby('Date').count()
plt.figure(figsize=(12,3))
plt.xlabel('Date')
plt.ylabel('Num of Calls')
plt.title('Traffic')
plt.grid(True)
plt.plot(traffic.index, traffic['e'])
- Now let's move on to creating heatmaps with seaborn and our data. We'll first need to restructure the dataframe so that the columns become the Hours and the Index becomes the Day of the Week. There are lots of ways to do this, but I would recommend trying to combine groupby with an unstack method. Reference the solutions if you get stuck on this!
df1 = df.copy()
df1.set_index(['Day of Week','Hour'], inplace=True)
dfh = df1.groupby(level=['Day of Week','Hour']).sum()
pivot = dfh.pivot_table(values='e',index=dfh.index.get_level_values('Day of Week'),columns=dfh.index.get_level_values('Hour'))
pivot
- Now create a HeatMap using this new DataFrame.
plt.figure(figsize=(12,6))
sns.heatmap(pivot,cmap="viridis")
- Now create a clustermap using this DataFrame.
sns.clustermap(pivot,cmap="viridis")
- we need to install pandas-datareader to fetcj the data
pip install pandas-datareader - we can download data from link and import them
df = pd.read_pickle('all_banks') - How to do remote data access with panad using pandaReader [old version]http://pandas-datareader.readthedocs.io/en/latest/remote_data.html()
- we use the datareader to import data for each major bancs using the stock tickers (3Letter code)
- how to set column name levels
bank_stocks.columns.names = ['Bank Ticker','Stock Info'] - to select in a multi-level index we use .xs() cross-section
- we spec the column and the level where it is and the axis we work on
bank_stocks.xs(key='Close',level='Stock Info',axis=1).max() - pairploting with nulls throws error. we can drop lines
sns.pairplot(returns[1:]) - how to get the index with the min value? .idxmin()
- how to select rows between string index? use array slice
returns.ix['2015-01-01':'2015-12-31']. - how to plot selecting from multiindex using forloop and plotting with panda plots
for tick in tickers:
bank_stocks[tick]['Close'].plot(label=tick,figsize=(12,4))
plt.legend()
- using xs
bank_stocks.xs(key='Close',axis=1,level='Stock Info').plot(label=tick,figsize=(12,4))
plt.legend()
- using plotly for interaction
bank_stocks.xs(key='Close',axis=1,level='Stock Info').iplot()
- pandas have the .rolling() method which gets a window size param to apply functions on a rolling window of the data
BAC['Close'].ix['2008-01-01':'2009-01-01'].rolling(window=30).mean()
-
we will use the ISLR book from Gareth James as reading assignment
-
Statistical Learning is another name for Machine Learning
-
this book will be used as mathematical reference. the code in the book is in R
-
THe book is a reference if we want to dive into the mathematics behind the theory
-
Reading is optional
-
machine learning is a method of data analysis that automates analytical model building
-
using algorithms that iteratively learn from data machine learning allows computers to find hidden insights without beign explicitly programmed where to look
-
Machine Learning is used in:
- Fraud detection
- Web search results
- Real-time ads on web-pages
- Credit scoring and next-best offers
- Prediction of equipment failures
- New pricing models
- Network intrusion detection
- Recommendation Engines
- Customer Segmentation
- Pattern and Image Recognition
- Financial Modeling
-
The Machine Learning Process is the Following: Data Acquisition -> Data Cleaning -> [ Model Training & Building <-> Model Testing] || [Test Data -> Model Testing] -> Model Deployment
-
To clean the data we can use Pandas.
-
Clean Data are Split into Training Data and Test Data
-
we train our machine learing model on the training data
-
we test our model on the test data
-
we tune our model until testing on test data gives good results
-
then we deploy our model
-
There are 3 main types of Machine Learning algorithms
- Supervised Learning: we have labeled data and try to predict a label based on known features
- Unsupervised Learning: We have unlabeled data and we are trying to group together similar data points based off of features
- Reinforcement Learning: Algorithm learn to perform an action from experience
-
Supervised learning:
- algorithms are trained using labeled examples, like an input where the desired output is known e.g a piece of equipment could have datapoints labeled either F (failed) or R (runs).
- The learning algorithm receives a set of inputs along with the corresponding correct outputs.
- The algorithm learns by comparing its actual output with correct outputs to find errors. it then modifies the model accordingly
- through methods like classification, regression, prediction and gradient boosting, supervised learning uses patterns to predict the values of the label on additional unlabled data
- Supervised learning is commonly used in applications where historical data predicts likely future events
- e.g it can anticipate when credit card transactions are likely to be fraudulent or which isurance customer is likely to file a claim
- it can predict the price of a house based on different features for houses for which we have historical price data
-
Unsupervised Learning:
- is used against data that has no historical labels
- the system is not told the 'right answer' The algorithm must figure out what is being shown
- the goal is to explore the data and find some structure within it
- or it can find the main attributes that separate customer segments from each other
- popular techniques include: self-organizing maps, nearest-neighbour mapping, k-means clustering and singular value decomposition
- these algotruthms are also used to segment text topics recommend items and identify data outliers
-
Reinforcement Learnign:
- is often used for robotics, gaming and navigation
- with reinforcemnt learing, the algorithm discovers through trial and error which actions yield the greatest rewards
- this type of learning has three primary components: the agent(the learner or decision maker), the environment (everything the agent interacts with) and actions (what the agent can do)
- the objective is for the agent to choose actions that maximize the expected reward over a given amountof time
- the agent will reach the goal much faster by following a good policy
- the goal in reinforced learning is to learn the best policy
-
Each Algorithm or ML topic in this course includes:
- A reading assignment
- light overview of theory
- demonstration lecture using python
- ML project assignment
- Overview of Solution for Project
-
Disclaimer: Machine Learning is Hard to Learn: Take your Time, Be patient
- We will use the SciKit ML learning package. it's the most popular machine learning package for Python and has a lot of algorithms built-in
- we need to install it. using
conda install scikit-learnorpip install scikit-learn - we will install it in our plotly env
conda install -n plotly scikit-learn - in sccikit-learn every algorithm is exposed via an Estimator
- first we import the Model:
from sklearn.family import Modelfor example for Linear Regressionfrom sklearn.linear_model import LiearRegression - the second step is to instantiate the Model
- Estimator parameters: all the parameters of an estimator can be set when it is instantiated, and have suitable default values. we can see the possible params in Jupyter with shift+tab
- for example we can instantate the Linear Regresion estimator with:
model = LinearRegression(normalize=True)setting the normalize param to True. if we print the instanceprint(model)we see all the default params of the model instanceLinearRegression(copy_X=True, fit_intercept=True, normalize=True) - once we have our model created with the desired params, we can fit it on some data
- but this data must be split in two sets (training set and test set)
- we import numpy for datasets
import numpy as np - we import train test split from scikit
from sklearn.cross_validation import train_test_split - we then create our dataset and a set of labels
X, y = np.arange(10).reshape((5,2)), range(5) - we split our data
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3)which splits our data in a 3 to 2 ratio and randomizes the order - with the data split, we can train/fit our model on the trianing data
- this is done with the model.fit() method
model.fit(X_train,y_train) - now the model has been fit and trained on the training data
- the model is ready to predict labels or values on the test set!
- the proces we follow is a SUPERVISED LEARNOING process, in unsupervised we dont have the labels
- we get the predictions with
predictions = model.predict(X_test) - we can evaluate our model by comparing our predictions to the correct values (y_test)
- the evaluation method depends on the the kind of machine learing algorithm we use (e.g Regression, Classification, Clustering etc)
- On all Estimators the model.fit() method is available. for supervised learning applications it accepts 2 arguments, data X and labels y e.g model.fit(X,y)
- For unsupervised learning apps, this accepts only a single argument the data X
- In All Supervised Estimators we have a model.predict() method: given a trained model , predict the label oif a new set of data, this method accepts one argument the test data X_test and returns a learned label for each object in the array
- for some supervised estimators the model.predict_proba() method is available. it is used for classification problems and returns the probability that a new observaTION has each categorical label. in this case the label with the highest probability is returned by model.predict()
- model.score() is offered for classificsation or regression problems. as most estimators implement a score method. scores are between 0 and 1. a larger score indicates a better fit
- model.predict() is also available in unsupervised estimators to predict labels in clustering algorithms
- model.transform() is available in unsupervised estimators: given an unsupervized (trained) model, transfor,s new data 9test data) in teh the new basis. it accepts X_new and returns the new representation of the data based on the unsupervised model.
- some unsupervised estimators implement the model.fit_transform() method which more efficiently performs a fit and a transform on the same input data
- Linear REgression was devloped in 19th century by Francis Galton.
- He was investigatiing the relationship between heights of fathers and sons
- He discovered that a mans son tended to be roughly as tall as his father.
- By he found that the sons height tended to be closer to the overall average of all people
- This phenomens was called regression, as a sons height regress (drifts towards) the mean average height
- all we are trying to do when we calculate the regression line is to draw a line that as close to every fdot in the dataset as possible.
- for classic linear regression (Least Squares Method) we measure the closeness in the 'up and down' direction
- we apply the same concept to multiple points
- THe goal in linear regression is to minimize the vertical distance between all the data points and our line.
- So to determine the best line we attempt to minimize the distance between all the points and their distance to the line
- there are many ways to do it (sum of squared errors, sum of absolute errors etc) but all these methods have a general goal of minimizing the distance
- the most common methods is least squares method (minimize the sum of squares of the residues)
- the residuals for an observation is the difference between the observation (y-val) and the fitted line
- we import split as
from sklearn.model_selection import train_test_split - we will start off by working with a housing data set trying to create a model to predict housing proces based off of existing features
- to ease our process we will work with artificially created datasets,
- latar on we will use real data sets from kaggle
- we import pandas, numpy pyplot from matplotlib, seaborn and set matplotlib inline
- we load our csv to a dataframe
df = pd.read_csv('USA_Housing.csv') - we have averaged value columns for the area the house is. area income, age,roums,bedrooms,population
- we have also the hoses price and address
- we chck our data with
df.info() - we can also check
df.describe()to get statistical info about the data - we can check our columns with
df.columns() - we should get a better insight on the data, so we pairplot the table
sns.pairplot(df). we get histograms on the columns and also correlation scatterplots. se see that bedrooms and rooms are segmented as the have integer values (represented as floats) - we want to predict the price of the house do we do a distribution plot
sns.distplot(df['Price']). we see that kde peaks a t 1.2M and distributes evently around it - we also want an insight of the correlation between columns so we create a heatmap
sns.heatmap(fd.corr())we immediately see what affects the price. - we have sufficient insight to start building our linear regression model
- First we have to split our data into an X array containing the features to train on, and a Y array with the target variable (price). we will also delete the address column as we cannot process text. we do it manually setting the columns
df.columns
X = df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms',
'Avg. Area Number of Bedrooms', 'Area Population']]
y = df['Price']
- with features and target arrays ready, we now have to split our data into training and test sets. we import the module and do the split specifying the test size. we can see the train_test_split params with Shift+Tab. we use 40% for test data. random state is used to seed the split as it is done randomly. so selection of test and train data is random.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
- with data ready we need to train our model
- we first import the Estimator
from sklearn.linear_model import LinearRegression - then we instantiate our model (estimator)
lm = LinearRegression() - we check the avaialble methods in our model (Shift+tab). we choose to train our model passing the train data
lm.fit(X_train,y_train) - we evaluate our model by checking its coefficients and see how we can evaluate them
print(lm.intecept_)=> a number, the coefficients show the relation to each feature in our datalm.coef_ - to see the coefficient mapping better we create a dataframe.
pd.DataFrame(lm.coef_,X.columns=['Coeff']) - What these coefficients mean? if i hold all other features fixed. a 1 unit increase of a feature will result to the relevant coefficient amount on the target. these are artificial data so they dont make much sense
- we can load real data from inbuilt sklearn datasets.
- boston is a dictionary with some keys
from sklearn import load_boston
boston = load_boston()
boston.keys()
print(boston['DESCR']) # description
print(boston['data']) # data
print(boston['feature_names']) # column labels
print(boston['target']) # targeet [pricing]
- we will now see how to get predicitons from our model
- we get the predictions from the model
lm.predict(X_test) predictionsare the predicted values whiley_testcontains the target test data. we need to compare these two.- we want to see the correlation so we create a scatter plot
plt.scatter(y_test, predictions). if they line up linearly with relative small jitter we know our prediction is good. a perfect straight line are perfectly correct predictions (unrelaistic) - we will now do a ditribution of the residuals (difference between predictions and test values)
sns.distplot(y_test-predictions). they look normally distributed. this is a GOOD sign. it means we selected the correct model - there are 3 evaluation metrics for regression problems:
- mean absolute error (MAE) the mean of absolute value of errors: easiest to understand (average error)
- mean squared error (MSE) the mean of squared errors: more popular as it punishes larger errors (real world approach)
- root mean squared error (RMSE) the square root of the means of the squared errors : more popular than MSE as it is interpretable to Y units (like MAE)
- all these are loss functions , our aim is to minimize them
- we can easily calculate them with sklearn. we import the module
from sklearn import metrics - from metrics we use the specialized functions
print('MAE:', metrics.mean_absolute_error(y_test, predictions))
print('MSE:', metrics.mean_squared_error(y_test, predictions))
print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predictions)))
- Bias Variance Trade Off is a fundamental topic of understanding our model's performance
- in Chapter 2 of ISL book there is an indepth analysis
- the bias-variance trade off is the point where we are adding just noise by adding model complexity (flexibility)
- the training error goes down as it has to, but the test error is starting to go up
- the model after the bias trade-off begins to overfit
- imagine that the center of the target is a model that perfectly predict the correct values
- as we move away from the bulls-ey our predictions get worse and worse
- sometimes we get a good distibution of training data so we predict well and we are close to bulls eye, while somtimes our training data might be full of outliers or non-standard values rtesulting in poor predictions
- these different realizations result in a scatter of hits on the target (variance)
- we aim for low variance-low bias model (consistent low error predictions)
- high bias-low variace are consistent but off target predictions (high error)
- low-bias-highvariancea are non consistent on target predictions
- a commot temptation for beginners in ML is to continually add complexity to a model until it fits the training set very well. like a polyonymal fit curve matching vert well a set of training points
- doing this can cause the model to overfit (high bias) to our trainign data and can cause large errors on new data like the test set
- we will take a look at an example model on how we can see overfitting occur from an error standpoint using test data!
- we will use a black curve with some "noise" points off of it to represent the True shape the data follows
- we have some data points. we start with simplest linear fit, then quadratic then spline raising the complexity to achieve best fit
- we compare the MSE on test data and train data for these three cases drawing a line we see that the test-train MSE difference is high for linear average for quadratic and then starts to deviate high for test low for training data.
- so initial we have high bias low variance (high error similar train-test) in the middle average (low bias low variance) and then hisgh varianc e (big deviation) low bias (low MSA for training) .
- we choose the average (bias-variance trade-off) under this we have underfitting and over it overfiting
- Sections 4-4.3 of ISL book
- we want to learn about Logistic Regression as a method for Classification
problems solving
- Spam versus Good Emails
- Loan Default (Yes/No)
- Diesease Diagnosis (Yes/No)
- All the above are examples of Binary Classification
- Classification can be Categorical
- In Linear Regression we saw regression problems where we try to predict a continuous value
- Although the name may be confusing, logistic classification allows us to solve classification problems, where we are trying to predict discrete categories
- the convention for binary classification is to have two classes 0 and 1
- we cannot use normal linear regression model on binary groups. it wont lead to a good fit
- sat we want to predict the probability of defaulting on a loan. the feature is the salary. the higher the salary the higher the probaility of paying back the loan 1.0 is the payback and 0.0 ois to default. outr train data are binary so weither 1 or 0. with linear fit we get bad fit. we might get <0 numbers which make no sense.
- we can transform our linear regression line to a logistic regression curve (s type) bound between 0 and 1.
- this curve is the plot of the Sigmoid (aka Logistic) Function which takes any vlaue and outputs it to be between 0 and 1. f(z)=1/(1+exp^-z)
- The SIgmoid Function takes any input. So we can take our Linear Regression Solution and place is into the Sigmoid Function. if our linear model is y=b0+b1x* our logistic model will be p=1/(1+exp^(b0+b1x))*
- this results in a probability from 0 to 1 of belonging in the 1 class. then we set a cutoff point usually at 0.5 anything above we consider it class 1 (True) and anything below class 0 (False)
- After we train our logistic regression model on some training data, we will evaluate our model's performance on some test data
- We can use confussion matrix to evaluate classification models
- e.g if we test for disease (Yes/No) we have a 2by2 matrix. the row will be Actual NO, Actual YES and the columns Predicted NO, Predicted YES
- Predicted NO & Actual NO = True Negative (TN)
- Predicted YES & Actual YES = True Positive (TP)
- Predicted YES & Actual NO = False Positive (FP) Type I Error
- Predicted NO & Actual YES = False Negative (FN) Type II Error
- The first Metric is Accuracy (TP+TN)/total
- Missclassification Rate or Error Rate (How often is it wrong) (FP+FN)/total Accuracy = 1 - Error Rate
- We will explore Logistic Regression using the famous Titanic data set to attemp to predict wether a passenger survived based off of their features
- Then we will have a project with some Advertising data trying to predict if a customer clicked on an ad
- titanic is a typical beginners se t for classification
- Kaggle hosts datasets for training
- Competition offers challenges for some sort of price
- we import all libs (numpy,pandas,matplotlib.pyplot,seaborn)
- we load training data
train = pd.read_csv('titanic_train.csv')in a panda dataframe - we check its columns and see there is a survived column that is our target, passenger class,gender,age,sibsp (siblings/spouses aboard)mparch (parents or children aboard), ticket id, ticket fare. cabin, port of embarkation
- we will start investigating our dataset (exploratory data analysis)
- we wull look for missing data with
train.isnull()which outputs a bolean dataframe which we can feed in a heat map to find missing datasns.heatmap(train.isnull(),yticklabels=False,cbar=False,cmap='viridis'). we remove tickdata and colorbar. the results are revelatory. ~20% of age data is missing and ~80% of cabin data. - 20% of missing data is replaceable. at 80% we drop the column, or make it boolean (cabin known/not known)
- its always good to see our target data in depth. we start with a simple countplot
sns.countplot(x='Survived',data=train)we add a hue based on sex to the countplot. females had much hihgher survival rate than males (we expect to see it in coefficients). we also add ahue based on Pclass. again its revelatory. peopl from 3rd calss had much lower survival rate - we do a distplot on a numerical value age, removing the nulls
sns.distplot(train['Age'].dropna(),kde=False,bins=30). we have a binomial plot. a number of small children then a gap and then the peak of people of young age. - we explore columns one by one trying to get an understanding of our data set. w ego to SIbSp and do a count plot as it is discrete (int) and few.
sns.countplot(x='SibSp',data=train) - we look into fare. we do a pandas histplot train['Fare'].hist(bins=40) fares lean hevily on the cheap side
- we can do an interactive cufflinks plot on fare
import cufflinks as cf
cf.go_offline()
train['Fare'].iplotkind="hist",bins=30)
- After exploring our data, we will clean them, turn them in an acceptable form for a machine learning algorithm
- we will fill the missing age data with the mean average age of the dataset. this is a common practice. or we can do it more smartly by passing in the average age per passenger class as we guess there will be a correlation between class and age. to investigate it we do a boxplot
sns.boxplot(x='Pclass',y='Age',data=train). we see that paseengers in 3class tend to be older than 2nd or 3rd class - we will create a function to use it to fill in the data (impute)
def impute_age(cols):
Age = cols[0]
Pclass = cols[1]
if pd.isnull(Age):
if Pclass == 1:
return 37
elif Pclass == 2:
return 29
else:
return 24
else:
return Age
- we will apply this function
train['Age'] = train[['Age','Pclass']].apply(impute_age, axis=1) - we do again the heatmap to confirm correct impute
- we decide to drop the Cabin column completeley along column
train.drop('Cabin',axis=1) - with very few missing data left we drop them from dataframe
train.dropna(inplace=True) - our data are clean
- we will now turn categorical features to dummy data (eg. male/female to 0/1) or the Embarked city first letter to nums. we need to transorm these to numbers as machine learning algorithms work with numbers
- we do it using the pandas get_dummies method
pd.get_dummies(train['Sex'])this produces 2 mutualy exclusive columns. these are perfect predictor of the orher one if male is 1 female is 0. this is an issue called multicolumniarity. this messes up the algorithm so we drop females column withsex = pd.get_dummies(train['Sex'],drop_first=True)and keep the male column as a dataframe - we follow a similar approach for Embarcked column
embark = pd.get_dummies(train['Embarked'],drop_first=True)and we concat the tablestrain = pd.concat([train,sex,embark],axis=1) - we will ingore text based columns when we do our train/test data
train.drop(['Sex','Embarked','Name',Ticket'],axis=1,inplace=True) - we notice the passenger id is an index not useful for machine learning so we drop this.
train.drop(['PassengerId'],axis=1,inplace=True) - now our data is ready for machine learning algorithm
- Pclass column is categorical class column (1,2,3). w ecould do pd.get_dummies on that column (good for ML). we can do it later to see how the algorim will react on using dummy data or keeping categorical data.
- in our folder we have two csv files atraining ans test file. we have been working with the train file sofar and we will use it to get training and test data
- we should clean the test file and use it
- we make the X and y datasets
X = train.drop('Survived',axis=1)
y = train['Survived']
- we import test_train_split from scikit
from sklearn.model_selection import test_train_splitand do the split in a 30/70 ratioX_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.30, random_state=101) - we import the Estimator Model
from sklearn.linear_model import LogisticRegression - we create an estimator model instance
logmodel = LogisticRegression() - we train the model passign the train data leaving default params
logmodel.fit(X_train,y_train) - we generate our predictions passing the test data
predictions = logmodel.predict(X_test) - we are now reday to evaluate the results.
- scikit learn has a handy tool for classification model evaluation rport tool
- we import it
from sklearn.metrics import classification_report - we use it wrapping it with print and passing actual test values and predictions
print(classification_report(y_test,predictions)) - it return some metrics for both results. precision,recall,f1-score and support
- we can get the pure confusion matrix with
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,predictions)
- we have 81% accuracy. some ways of improving our scores is increase our training set (use both files),another way is grab feats of the name e.g title, cabin number
- a good step in exploration is to make a pairplot of our dataset with hue on the target column
- Theory in ISL ch4.
- K Nearest Neighbors is a Classification algorithm that operates on a very simple principle
- we will use an example to show it
- imagine we had some imaginary data on dogs and horses, with heights and weights
- if we do a jointplot between weight and height of dogs,horses we will see the linear correlation. if we add a hue on the plot baed on kind (horse/dog) we see that dogs are at the low end and horses on the high end. so from a weight/height datapoint it is quite easy to say with accuracy if it is a dog or a horse
- The KNN algorithm woirks as follows
- The training algorithm simply stores all data
- The prediction algorithm calculates the distance of a new data point from all points in our data, it sorts the points in our data by ascending distance from x, it calulates the majority label of the K nearest points and sets it a s the prediction
- k is critical on the algorithm behaviour as it will directly affect what class a new point is assigned to
- chosing 1 picks a lot of noise, chosing a large k creates a bias
- KNN Pros:
- Very Simple
- Training is Trivial
- Works in any number of Classes
- easy to add more data
- few params (K, Distance metric)
- KNN Cons:
- High Prediction Cost (increases as dataset gets larger)
- Not good with high dimensional data (distances to many dimensions)
- Categorical features dont work well
- A common interview task for a data scientisc position is to be given anonymized data and attempt to classify it. without knowing the context of the data.
- We will simulate this scenario using some "classified" data, where what the columns represent is not known but we have to use KNN to classify it
- we import the libraries
- we read in the classified data csv into df
- it has random column names with numbers and a target class of 1 or 0
- scale of data plays a big role in distance so it affects KNN. what we have to do as preprocess of our data is to standardize the variables rescaling them to the same scale. sklearn has a tool for the task. we impor it
from sklearn.preprocessing import StandardScalerwe instantiate itscaler = StandardScaler()and train it or fit it to our data (only on the numeric columns, not the target column which is categorical/binary)scaler.fit(df.drop('TARGET CLASS', axis=1)) - then we use the trained scaler object to get the scaled features
scaled_features = scaler.transform(df.drop('TARGET CLASS', axis=1)) - scaled_features is actually an array of values. we use this array to recreate a features table which we can use in our algorithm
df_feat = pd.DataFrame(scaled_features, columns=df.columns[:-1])we set as column names the column names of the original table excluding the last one (targets) - our dataset is ready
- we import data splitter from sklearn and use it to split our data
X_train, X_test, y_train, y_test = train_test_split(scaled_features,df['TARGET CLASS'],test_size=0.30) - we are ready to use the algorithm. we impor the Model
from sklearn.neighbors import KNeighborsClassifier - we instantiate it passing k=1 (num of neighbors)
knn = KNeighborsClassifier(n_neighbors=1)and fit it on our train dataknn.fit(X_train,y_train) - we grab the predictions
pred = knn.predict(X_test) - we import and use classification_report and confustion_matrix
- our results are good enough but we will use the embow method to pick a good k value
- the elbow method essentially reruns the knn algorithm for different k and store the mean error, then we plot the error vs k and find the best k
error_rate = []
for i in range(1,40)
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train,y_train)
pred_i = knn.predict(X_test)
error_rate.append(np.mean(pred_i != y_test)) # average of where the predictions where not equal to test values
- we plot the error rate to the k
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
- we look at the error rate. we stabilize around 20 at the low range so we run for 17 and get our reports accuracy and precicion are better
Section 19 - Decision Trees and Random Forests
- chapter8 of ISL.
- We will see the rationale behind the Tree Methods by an example.
- Say we paly Tennis with a friend every Saturday. sometimes the friend shows up, sometimes not. For him, it depends on a variety of factors like: weather, temperature, humidity, wind etc
- We keep track of these feats and whetther we showed up to play or not createing a dataset
- We want to use this data to predict whether or not he will show up to play.
- An easy way to do it is through a Decision Tree
- in this tree we have :
- nodes. nodes split for the value of a certain attribute
- edges: outcome of a split to next node
- root: the node that performs the first split
- leaves: terminal nodes that predict the outcome
- the Intuition behind Splits is simple.
- we start bybuilding a truth table of all feats as columns, with its result being the target value. we write down all combinations anthe outcome. in a simple 3 feat example with a boolean output we see that the output has 1:1 relationship with Y, so the tree has 1 root node (y) and two leaves (outputs). we say that Y gives perfect separation between classes
- other feats (X,Z) dont split classes perfectly. so if we split on these feats first we dont get good separation.
- Entropy and Information Gain are the Mathematical Methoids of choosing the best split.
- Entropy: *H(S) = -ÎŁi(pi(S)log2pi(S))
- Information Gain: IG(S,A)=H(S)- ÎŁ u belongsto Values(A)(|Su|/SH(Su))*
- Aside for the mathematical foundation the real intuition behind the splits is trying to choose the feat that best splits the data (trying to maximize the information gain of the split)
- Random Forest is a way to Improve performance of single decision trees.
- The main weakness of decision trees is that they dont have the best predictive accuracy
- This is mainly because of the high variance (different splits in the data can lead to very different trees)
- Bagging is a general purpose procedure for reducing the variance for a ML method
- We can build up the idea of Bagging by using Random Forests, Random Forest is a slight variation of this Bag of Trees that has even better perfirmance.
- We will discuss it and code it in R (!?!)
- What we do in Random Forest is we create an ensemble of Trees using Bootstrap samples of the Training Set.
- Bootstrap samples of a training set means sampling from the dataset with replacement.
- However we are builidng each tree, each time a split is considered, a random sample of m features is chosen as a split candidate from the full set of p features.
- The split is only allowed to use one of these m features
- A new random sample of features is chosen fo revery single tree at every single split
- For classification this m random sample of m features is typically chosen to be the square root of p, where p is the full set of features
- WHY TO USE RANDOM FORESTS???
- suppose there is one very strong feature in the dataset, strong in predicting a cetrtain class. When using "bagged" trees (bootrap sampling), most of the trees will use that feature as the top split, reszulting in an ensemble of similar trees that are highly correlated. this is something we want to avoid
- averaging highly correlated quantities does not significantly reduce variance
- By randomly leaving out candidate features from each split, random forests "decorrelates" the trees, sych that the averaging process can reduce the variance of the resulting model
- with that process, features that really strongly predict the class data do not affect the result
- We will take a look at at asmall data set of Kyphosis patients and try to predict if a corrective spine surgery was successful.
- for our exercise project we will use loan data from Lending CLub to predict default rates
- BlogPost
- import libraries
- we load our dataset from a csv ('kyphosis.csv') which shows the result of the corrective operation on children. the features are: age in months, number of spinal disks operated, and the index of the first spinal disk operated. the target is a boolean (successuful or unscucessful)
- the data set is small (81 samples)
- we do exploratory analysis on the data
- we do a pairplot with the target as hue.
sns.pairplot(df,hue='Kyphosis') - we import the splitter method and split our dataset on train and test data and results (droping the Kyphosis column)
- we start by testing a single decision tree.
- we import the estimator
from sklearn.tree import DecisionTreeClassifier - we instantitate the model
dtree = DecisionTreeClassifier() - we train it with data using default params
dtree.fit(X_train,y_train) - we ge the predicitions
predictions = dtree.predict(X_test) - we evaluate the results: import and print classification re port and confusion_matrix. results are rather good resustsl
- we will now use random forest:
- we import the estimator
from sklearn.ensemble import RandomForetClassifier - we instnatiate it seting the number of estimators
rfc = RandomForestClassifier(n_estimators=200) - we train, predict and evaluate the model. we have an improvemnt.
- as dataset gets larger random forest gets better results
- scikit has visualisation for decision trees. we wont use it often as we will use random forests which get better results.
- this plot comes in a lib called pydot that we have to install
pip install pydot - we also need the graphviz library. graphviz is a separate program altogether
sudo apt-get install graphviz
from IPython.display import Image
from sklearn.externals.six import StringIO
from sklearn.tree import export_graphviz
import pydot
features = list(df.columns[1:])
features
dot_data = StringIO()
export_graphviz(dtree, out_file=dot_data,feature_names=features,filled=True,rounded=True)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
Image(graph[0].create_png())
- Random forest is the baseline of ML
- we will use public available data from LendingClub regarding peer-to-peer credit
- our aim is to use not.fully.paid column as taget and predict if the loan was flly paid back or not
- they ask to plot a histogram of FICO score with hue based on categorized columns (hue) using pandas. doing hue withpanda viz is done like this
plt.figure(figsize=(10,6))
loans[loans['credit.policy']==1]['fico'].hist(bins=35,color='red',label='Credit Policy 1',alpha=0.6)
loans[loans['credit.policy']==0]['fico'].hist(bins=35,color='blue',label='Credit Policy 0')
plt.legend()
plt.xlabel('FICO')
- chapter9 of ISL
- Support vector machines (SVMs) are supervised learning models with associated learning algorithms tha analyze data and recognize patterns, used for classification and regression analytics
- Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier
- An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible.
- New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on
- to show the basic idea behind SVMs we imagine a labeled training data set where the hued(target) joint scatterplot of 2 feats are clearly separated between target classes.
- we want to do binary classification of new points, we can easily draw a separating "hyperplane" between the classes. but there are many options for separating lines (hyperplanes) that separate classes perfectly. how do we choose the best? usually we have one going along the middle of the distance and two parallel ones on the border(margin) of each class.
- The vector points that these margin lines touch are known as Support Vectors.
- We can expand this idea to non linearly separated data through the "kernel trick", the lines might be circles,
- the kernel trick is adding one more dimension to the plot so he can separate them in the 3rd dimension
- we will use support vector machines to predict whether a tumor is malignant or bening
- for our project we will apply the concept to the famous iris flower data set
- we will learn how to tune our models withthe GridSearch
- we import the libs
- we import a scikit learn builtin dataset (Breast cancer)
from sklearn.datasets import load_breast_cancerand load it to a dictionarycancer = load_breast_cancer() - we see the feats as dictionary keys
cancer.keys() - we get info on the origin of the dataset
print(cancer['DESCR']) - we build a dataframe from the dicitonary
df_feat = pd.DataFrame(cancer['data'],columns=cancer['feature_names']) - we check the labels with
df_feat.head(2)anddf_feat.info() - we have 30 cols of medical data. we lack of domain knowledge so we skip visualization
- we build our target df
df_target = pd.DataFrame(cancer['target'],columns=['Cancer']) - we go to ML directly. we split the data
X_train, X_test, y_train, y_test = train_test_split(df_feat, np.ravel(df_target), test_size=0.30, random_state=101) - we import the estimator
from sklearn.svm import SVC - we instantiate the model
model = SVC() - we train the model using defaults, we do the prediction and evaluate the results. we get poor results and a Warning: UndefinedMetricWarning: Precision and F-score are ill-defined and being set to 0.0 in labels with no predicted samples.
- What we see that the model classifies all data to a single class. so it needs adjustment and probably normaization of the data prior to use
- We will search for the best parameters using a GridSearch, it helps us find the right paramters (c or gamma)
- To skip testing and searching for the right params we will use the grid of parameters to try out all the best possible combinations and see what fits best.
- we import it
from sklearn.grid_search import GridSearchCV - GridSearchCV takes in a dictionary that describes the parameters that should be tried in a model to train the grid., the keys are the parameters and the values are a list of settings to be tested.
- C param controls the cost of misclassification on the training data (a large C value gives low bias and high variance). it gives low bias because we penalize the cost of missclassification with a larger C value.
- the gamma parameter has to do with the free parameter of the Gaussian radial basis function (kernel='rbf'). this function is the best kernel to use. small gamma mean as Gaussian for large variance. high gamma leads to high bias low variance in the model
- we set the param_grid passing the params to test and the range of values as a dictionary
param_grid = { 'C':[.1,1,10,100,1000], 'gamma':[1,0.1,0.01,0.001,0.0001] } - we feed the param grid to the gridsearch
grid = GridSearch(SVC(),param_grid,verbose=3) - versbose number controls the amount of text
- we pass the estumator,the param grid and config params
- the output is a estimator model instance we can use to tainr on out training data.
- it finds a best combination and iterates on this to get the best score
- we gan get the best params with
grid.best_params_ - we gan get the best estimator with
grid.best_estimator_ - we gan get the best score with
grid.best_score_ - the grid has the best estimator ready to use. so we use it to get the best predictions
grid_predictions = grid.predict(X_test) - we print the eval reports. the improvent is dramatical
- we will work with famous iris flower dataset. only 50 samples, 3 samples
- to display image in jupyter notebook:
from IPython.display import Image
url = 'http://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg'
Image(url,width=300, height=300)
- the K Means Clustering algotithm will allow us to cluster unlabeled in an unsupervised machine learning algorith
- mathematical explanation in ch10 of ISL
- K Means Clustering is an unsupervised learning algorithm that will attempt to group similar clusters together in your data
- A typical clustering problem:
- Cluster similar documents
- Cluster Customers based on Features
- Market segmentation
- Identify similar physical groups
- The overall goal is to divide data into distinct groups such that observations within each group is similar
- The K Means Algorithm:
- Chose a number of Clusters "K"
- Randomly assign each point to a cluster
- Untill clusters stop changing, repeat the following: For each cluster, compute the cluster centroid by taking the mean vector of points in the cluster. => Assign each data point to the cluster for which the centroid is the closest
- Choosing a K value: we have to decide how many clusters we expect in the data
- There is no easy answer for choosing the best "K" value
- One way is the elbow method
- Compute the sum of squared error (SSE) for some values of k (for example 2,4,6,8)
- The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid
- If we plot k against the SSE, we will see that the error decreases as k gets larger. this is because when the number of clusters increases, they should be smaller
- the idea of the elbow is to choose the k at which the SSE decreases abruptly
- this produces an elbow effect int the graph (increasing k does not improve SSE so much any more)
- in our project we will get real world data and try to cluster unis into groups, we will try to distinguish private from public
- we import the usual libraries (data analyzis + data viz)
- in unsupervised lewnring we are not focused on results but on patters in the data
- we import sklearn inbuilt data sets (blobs of data)
from sklearn.datasets import make_blobs - we create a dataset for our algo with 200 samples from blob generator passing some parameters to control the distribution of our samples.
data = make_blobs(n_samples=200, n_features=, centers=4, cluster_std=1.8,random_state=101) - with data in hand we will explore them. data is a tuple and data[0] is a numpy array. this numpy array is a number of columns and 2 columns of dfeatures
data[0].shape=> (200,2) 200samples with 2 features per sample - we have defines 4 centers in our data so we expect to see 4 blobs
- we plot a scatterplot of the 2 feat columns
plt.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='rainbow')with color set on the samples cluster group as it was outputed by make_blob - we import our estimator
from sklearn.cluster import KMeans - we create an instnace of our model. we need to define the number of clusters we want (we know it)
kmeans = KMeans(n_clusters=4) - we train the algorithm passing a numpy array (unlike supervised algs)
kmeans.fit(data[0]) - we get the centers (x,y) of the clusters on the 2 feat defined space. as a numpy array of nested arrays
- we get the lables the agorithm beleaves our samples should have regarding the clusters they belong
kmeans.labels_whic is a numpy array - if we work with real data and we dont know beforehand the labels and the actual clustering our work is done. but now we know the actual clustering from the blob generator output
- we will plot the predicted clustering to the actual clustering, we plot 2 plots in one sharing y axis
fig, (ax1,ax2) = plt.subplots(1,2,sharey=True,figsize=(10,6))
ax1.set_title('K Means')
ax1.scatter(data[0][:,0],data[0][:,1],c=kmeans.labels_,cmap='rainbow)
ax2.set_title('Original')
ax1.scatter(data[0][:,0],data[0][:,1],c=data[1],cmap='rainbow')
- the plots are look alikes so kmeans did a good job. if we try kemans with 2 or 3 clusters the output is very rreasonable too
- EDA means (exploratory data analysis)
- to set a specific value in a dataframe
df.set_value('95','Grad.Rate',100). this creates empty rows with index integer. we trydf.iloc[95, df.columns.get_loc('Grad.Rate')] = 100
- math in ch10.2 of ISL
- PCA is an unsupervised statistical technique used to examine the interrelations among a set of variables in order tp identify the underlying structure of those variables
- It is also known sometimes as a general factor analysis
- Where regression determines a line of best fit to a data set, factor analysis determines several orthogonal lines of best fit on the data set
- Orthogonal means "at right angles"
- These lines are perpendicular to each other in n-dimensional space
- n-Dimensional Space is the variable sample space
- there are as many dimensions as there are variables, so in a data set with 4 variables the sample space is 4-dimensional
- To understand the intuition behind it we plot some data alonf 2 features in a scatterplot.
- we plot the regression line of best fit
- we add an orthogonal line to the regression line.
- now we can start to understand the components of our dataset
- Components are a linear transformation that chooses a variable system for the data set such that the greatest variance of the dataset comes to lie on the first axis, the second greatest variance on the second axis and so on...
- This process allows us to reduce the number of variables used in an analysis
- in our plot the regression line "explains" 70% of the variation, the orthogonal line "explains" the 28% of the variation. 2% of the variation remains unexplained
- Components are uncorrelated, since in the sample space they are orthogonal to each other
- we can continue this analysis into higher dmensions although its impossible to plot over the 3rd dimencion
- If we use this technique on a data set with a large number of variables, we can compress the amount of explained variation to just a few components
- The most challenging part of PCA is interpretting the components
- For our work with PCA in Python, we will use scikit learn
- We usually want to standardize our data by some scale for PCA, so we will see how to do it
- The algorithm is used for analysis of data and not a fully deployable model, we wont have a project for the topic
- PCA is a unsupervised learning algorithm used for component reduction
- PCA is just a transformation of our data and attempts to find out what features explain the most variance in our data
- we import the usual libs (data analysis and vis)
- we load sklearn builtin datasets breast cancer data
from sklearn.datasets import load_breast_cancerandcancer = load_breast_cancer()it is a dictionary like object holding data and labels as key value pairs - we make a dataframe out of it
df = pd.DataFrame(cancer['data'],columns=cancer['feature_names']) - the dataset has 30 attibutes
- if we used an other calssification algorithm we would do PCA first to see what feats are important in determining if a tumor is malignant or benign
- we will use PCA to find the 2 principal components and then visualize data inthe 2 dimensional space
- we first scale our data
from sklearn.preprocessing import StandardScaler - we will scale our data so every feat has 1 unit variance
- we make a scale rinstance
scaler = StandardScaler() - we fit it to our dataframe
scaler.fit(df) - we transform our df with scaler
scaled_data = scaler.transform(df) - we perform now the PCA. we import it
from sklearn.decomposition import PCA - we instantiate it passing the num of components we want to keep
pca = PCA(n_components=2) - we fit pca to our scaled data
pca.fit(scaled_data) - we transform the data to its first principal component
x_pca=pca.transform(scaled_data) - our original scaled data shape is
scaled_data.shape=> (569,30) - our decomposed data
x_pca.shape=> (569,2) - we plot out these dimensions
plt.figure(figsize=(8,6))
plt.scatter(x_pca[:,0],x_pca[:,1])
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
- it is difficult to understand the usefuless of this plot, we add a color depeding on target class
plt.scatter(x_pca[:,0],x_pca[:,1],c=cancer['target'])we see the clustering in full extent. - based only on the first and second principle component we have aclear separation
- PCA works like a compression algorithm in machine learning
- these components we generate do not map 1to1 to specific feats in our data
- components are combinations of the original features
- if we print the array of components with
pca.components_each row represents a principal component and each column relates it back to the original features - we can visualize this relationship with a heatmap to see the depnedence of a component on the feats
df_comp = pd.DataFrame(pca.components_,columns=cancer['feature_names'])
plt.figure(figsize=(12,6))
sns.heatmap(df_comp,cmap='plasma',)
- The idea is to feed pca output to a classification algorithm reducing the complexity (logistic regression)
- reading textbook: Recommender Systems by Jannach and Zanker
- Fully developed and deployed recommendation systems are extremely complex and resource intensive
- there are 2 notebooks for this course. a) Recommender Systems w/ Python b) Advanced Recommender Systems w/ Python
- Full recommender systems require a heavy linear algebra background (WE HAVE IT!!) the Advanced Recommender System notebook is provided as an optional resource
- A simpler version of creating a recommendation system using item similarity is used for the course example
- The two most common types of recommender systems are Content Based and Collaborative Filtering (CF)
- Collaborative Filtering produces recommendations based on the knowledge of users attitude to items, that is it uses the "wisdom of the crowd" to recommend items
- Content-based recommender systems focus on the attributes of the items and give you recommendations based on the similarity between them
- In real world, Collaborative filtering (CF) is more commonly used than content-based systemsbecause it usually gives better results and is relatively easy to understand (From an implementation perspective)
- THe algorithm has the ability to do feature learning on its own, which means that it can start to learn for itself what features to use
- CF can be divided into two subcategories: Memory-Based Collaborative Filtering and Model-Based Collaborative Filtering
- In the advanced notebook, we implement Model-Based CF using a singular value decomposition (SVD) and Memory-Based CF by computing cosine similarity
- For the simple Python implementation, we will create a content based recommender system for a data set of movies
- This movie data set is usually a student's first data set when beginning to learn about recommender systems
- It is quite large compared to some of the data set we ve used so far. in general, recommender systems in real life deal with much larger data sets
- we import the usual data analytics and vis libraries
- we create a list of the column names
columns_names=['user_id','item_id','rating','timestamp'] - we load the data from csv
df = pd.read_csv('u.data',sep='\t',names=columns_names)which is tab separated. we also pas the column names as csv is just raw data. - the data set we go through is a famous one called movielens dataset. which contains user ratings for movies
- we grab movie_titles from another csv
movie_titles=pd.read_csv('Movie_Id_Titles'). this dataset maps item ids to movietitles. we will replace item-ids in the master dataset with the actual titles doing a mergedf=pd.merge(df,movie_titles, on='item_id').the merged dataframe contains id and titles now - we will show the best rated movies. to do this we will create a rating dataframe
df.groupby('title')['rating'].mean()calculating the avearge rate for each title. we can sort them now in ascending orderdf.groupby('title')['rating'].mean().sort_values(ascening=False).head() - the info might be misleading as the average score might be due to only one review. so we also count the votes
df.groupby('title')['rating'].count().sort_values(ascening=False).head() - we make a new dataframe called ratings with the titles and the ratings
ratings = pd.DataFrame(gf.groupby('title')['rating'].mean()). - we add a num of rating column next to the ratings
rating['num of rating'] = pd.DataFrame(df.groupby('title')['rating'].count()) - we explore the new ratings dataset with some histograms.
ratings['num of ratings'].hist(bins=70)most titles have very few ratings - we do a histogram of the average ratings per title. there are peaks in whole nums (as titles get whole num rates per review)
ratings['rating'].hist(bins=70) - we use jointplot to see distribution of ratings vs nums
sns.jointplot(x='rating',y='num of ratings', data=ratings, alpha=0.5)
- we will create a matrix that has user ids in one axis and movie titles on other axis. each cell will have the rating the user gave to the movie, we will use pivot to bring df in matrix form
moviemat = df.pivot_table(index='user_id',columns='title',values='rating'). this matrix has a alot of null values - we check the top ten rated movies
ratings.sort_values('num of ratings',ascending=True).head(10) - we choose 2 titles: Star Wars and Liar Liar and make datasets with their user ratings.
starwars_user_ratings = moviemat['Star Wars (1977)']
liarliar_user_ratings = moviemat['Liar Liar (1997)']
- the two dataframes have user_id rows with the ratings they gave
- we will use the corrwith() method yo vompute the pairwise correlation between rows or columns of two Dataframes
similar_to_starwars = moviemat.corrwith(starwars_user_ratings)we get all the movies and their correlation with the starwars movie rating - what we are looking for is the correlation of every other movie to the user behavior on the star wars movie
- we do the same for liarliar
similar_to_liarliar = moviemat.corrwith(liarliar_user_ratings) - we transform them to dataframes erasing the nuls
corr_starwars = pd.DataFrame(similar_to_starwars, columns=['Correlation'])
corr_starwars.dropna(inplace=True)
- what we end up is a dataframe with all the movies and a valuie that shows the correlation of their user rating to the user ratings of StarWars. what we want to get are the most correlated. the more similar movies
corr_starwars.sort_values('Correlation',ascending=False).head(10). we get movies with perfect correlation. these movies in our opinion dont make sense. probably are movies whit just 1 review from a person who gave star wars 5 stars. so we would want to filter out movies with less that a threshold of reviews. like 100. - we join the table with the num of rATINGS COLUMN
corr_starwars = corr_starwars.join(ratings['num of ratings'])we filter out movies with <100 revies and sort it again by correlationcorr_starwars[corr_starwars['num of ratings']> 100].sort_values('Correlation',ascending=False).head(10) - the results now do make sense we get perfect correlation with the same movie. we get goo d correlation with similar movies and then the correlation drops significantlky
- now we do the same drill for liarliar
corr_liarliar = pd.DataFrame(similar_to_liarliar,columns=['Correlation'])
corr_liarliar .dropna(inplace=True)
corr_liarliar = corr_liarliar.join(ratings['num of ratings'])
corr_liarliar[corr_liarliar['num of ratings']>100].sort_values('Correlation',ascending=False).head(10)
- we can play with threshold to improve recommandations relevance
- Read Wikipedia Article for theory
- NLP serves a lot of purposes when working with structured or unstructured text data
- use cases: group news articles by topic, sift through 1000s of legal documents to find relevant ones
- With NLP we would like to: Compile Documents, Featurize Them, Compare their features
- Say we have 2 docs:
- "Blue House"
- "Red House"
- A way to featurize docs is based on word count (transform them as vectorized word count)
- "Blue house" => (red,blue,house) => (0,1,1)
- "Red house" => (red,blue,house) => (1,0,1)
- A document represented as a vector of word countsis called a "bag of words"
- we can use cosine similarity on teh vectors made to determine similarity sim(A,B)=cos(θ)=(AdotB)/(|A||B|)
- we can improve on Bag of Words by adjusting word counts based on their frequency in corpus(the group of all dosuments)
- we can use TF-IDF (Term Frequency-Inverse Document Frequency) tod o this
- This technique is used in ElasticSearch
- Term Frequency: Importance of a term within that document
- TF(d,t) = Number of occurences of term t in document d
- Inverse Document Frequency: Importance of the termn in the corpus
- IDF(t)=log(D/t) where D=total num of documents, t=number of documents with the term
- Mathematically we can express TF-IDF as: Wx,y=TFx,y * log(N/dfx)
- TF-IDF = term x within document y
- TFx,y = frequency of x in y
- dfx = number of documents containing x
- N = total number of documents
- TF-IDF contains the idea of importance of a word
- before we get started with NLP and Python we'll need to download an additional library,
conda install nltkorpip install nltkwe will do it in plotly env - our example notebook will be on a spam filter, our real project will be working with real data from Yelp, a review site
- we
import nltk - we run
nltk.download_shell()which runs an interactive shell in jupyter to select options - we type l and see the list of packages
- we type d and then stopwords to download the stopwords library
- we will use a UCI dataset called SMS Spam collection DataSet . we have it in the spamcollection folder
- we use list comprehension to load the data in a list
messages = [line.rstrip() for line in open('smsspamcollection/SMSSpamCollection')]and print its lengthprint(len(messages))it has 5574 messages - we will iterate through the 10 first messages to see their context
for mees_no,message in enumerate(messages[:10]):
print(mess_no,message)
print('\n')
- messages are labeled as spam or ham with the word then tab. we can check it with
messages[0] - we will use pandas to easily parse the message in dataset
messages = pd.read_csv('smsspamcollection/SMSSpamCollection',sep='\t', names=['label','message'])using the tab separation and adding labels - We will now do EDA on our data. we start with
messages.describe()which revails that we have quite a few duplicate messages - we will group by label and then describy to see how feats differentate between two classes
messages.groupby('label').describe() - our bulk of work with NLP is features engineering. to extract feats we need domain knowledge. the better we know the dataset the more insight we can get
- we check the length of the text messages and add it as a new column
messages['length'] = messages['message'].apply(len) - we will visualize the lengths of the messages in a histogram
messages['length'].plot(bins=50, kind='hist')as we increase the bins we get bimodal behaviour - weget some insight in length
messages.length.describe()the max length is 910 - its quite wierd so we opt ot look into it
messages[messages['length'] == 910]['message'].iloc[0]this is clear an outlier - we now want to plot the histogram of length but for both labels separated
messages.hist(column='length',by='label',bins=50,figsize=(12,4))the kdes are clearly different
- our effort now will focus on text preprocessing (vectors)
- we
import stringlibrary - our first task is to remove punctuation
- we create atest string
mess = 'Sample message! Notice: it has punctuation.' string.punctuationhas abunch of punctuation suymbols to be used for filtering- we will use list comprehencion on teh string to clear the string from punctuation
nopunc = [c for c in mess if char not in string.punctuation]. this is a punc free list of chars from our string. we reassemble itnopunc = ''.join(nopunc) - our task is now to remove stopwords
- we import stopwords lib
from nltk.corpus import stopwords - we test it
stopwords.words('english')prints out all english stopwords - we split our string to list of words
nopunc.split()and again use list conmprehencion to clear it from stopwordsclean_mess=[word for word in nopunc.split() if word.lower() not it stopwords.word('english')] - we bundle up all in a funct to apply in our dataset
def text_process(mess):
# Check characters to see if they are in punctuation
nopunc = [char for char in mess if char not in string.punctuation]
# Join the characters again to form the string.
nopunc = ''.join(nopunc)
# Now just remove any stopwords
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
- we apply it to 5 first rows for testing
messages['message'].head(5).apply(text_process)and IT WORKS - we could further normalize our dataset (remove single letter words)
- stemming isa very common way to continue processing our text data (considers similar meaning works as unique). stemming needs a reference dictionary to do it. nltk has that
- we now proceed with vectorization of our data
- we will create our bag of words in 3 steps
- Count how many times does a word occur in each message (Known as term frequency)
- Weigh the counts, so that frequent tokens get lower weight (inverse document frequency)
- Normalize the vectors to unit length, to abstract from the original text length (L2 norm)
- our result table will be a matrix of messages vs words(unique) with a counter of word appearance in a message. the columns will be the word vectors or bags of words
- we expect to have a lot of 0s in our matrix. scikit learn will output a Sparse matrix
- we import CountVectorizer
from sklearn.feature_extraction.text import CountVectorizerwhich accepts a lot of params - we instantiate it passing our custom function as analyzer. we directly chain the fit method to fit in on our data
bow_transformer = CountVectorizer(analyzer=text_process).fit(messages['message']) - we want to see how many words our vocabulary contains
print(len(bow_transformer.vocabulary_))it more than 11000 - our bow_transformer is ready to be used on our dataset
- we test it on a sample message.
message4 = messages['message'][3]
bow4 = bow_transformer.transform([message4])
bow4.shape
- the vector has the indexes of the words and the num of occurences in the message.
- its shape is a vector spanning the full vocabulary
- to get the actual word in teh vocabulary by its id we use
print(bow_transformer.get_feature_names()[4073])=> 'U'
- we already saw our bag of words transformar in action in one sample message
- we can use the transformers .transform() method and transform the whole dataset
messages_bow = bow_transformer.transform(messages['message'])our dataset has now the bow for each messagemessages_bow.shape=> (5572,11425) - we expect it to be as sparse table. so we will count the non-zero occurences
messages_bow.nnz=> 50548 cells so 1/1000 ratio, - we can calculate the sparcity in the table
sparcityt = (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1]))=> 0.079% so practicaly 0 - with our bow ready we can calculate TF-iDF using sklearn's tf-idf transformer
- we import it
from sklearn.feature_extraction.text import TfidfTransformer - we instantiate a transformer from it
tfidf_transformer = TfidfTransformer().fit(message_bow)fitting it on our bag of words - to test it we apply the trfidf transform on message4, bow that we have ready from before.
tfidf4 = tfidf_transformer.transform(bow4)the output ius a similar list but the balues are not counters like in bow but relevance scores (like elasticsearch) or weight values - we can use tfidf to check the document frequency of a given word
tfidf_transformer.idf_[bow_transformer.vocabulary_['university']]=> 8.527076498901426 - we can create the full dataset tfidf with
messages_tfidf = tfidf_transformer.transform(messages_bow) - so what we get is a matrix of tfidf scores.
- what we are doing here is building an NLP pipeline
- with the tfidf scores ready for the whole dataset ready we can now train the spam/ham classifier
- we can use any classification algorithm to do it we choose the naive bayes classifier (also recommended in the cheat sheet)
- we import it
from sklearn.naive_bayes import MultinomialNB - we intantiate it creating a model and fiting it on our training data
spam_detect_model = MultinomialNB().fit(messages_tfidf,messages['label']) - we evaluate our model on message4
spam_detect_model.predict(tfidf4)[0]=> 'ham' , we compare it to the actual labelmessages['label'][3]and it is the same - if we wnat to evaluate all our messages
all_pred = spam_detect_model.predict(messages_tfidf) - we have trained all inout training data so evaluation on training data does not make sense
- we will split our data to do proper evaluation
msg_train, msg_test, label_train, label_test = \
train_test_split(messages['message'], messages['label'], test_size=0.3)
- so we now have to go through the complete pipeline to come up with our tf_idf to feed to naive bayes for our train set
- it is such a common process that sklear has a ready data pipeline. we import it
from sklearn.pipeline import Pipeline - we instantiate it passing the steps
pipeline = Pipleline([
('bow',CountVectorizer(analyzer=text_process)),
('tfidf', TfidfTransformer()),
('classifier',MultinomialNB())
])
- our pipeline is ready to use it to our train data.
pipeline.fit(msg_train,label_train) - we evaluate it
predictions = pipeline.predict(msg_test) - we do our classification reports and evaluate on label_test. the results are exceptionaly good. 96%
- pipeline is agood way to evaluate algorithms on the fly
- we will use NLP to categorize reviews into * (onestar) or ***** (fivestar) category based on their content
- we will use the pipeline to simplify the flow
- fit_transform() method fits and transforms in the same time
- In this Lecture we will see:
- Explanation of hadoop, MapReduce, Spark and PySPark
- Local vs Distributed Systems
- Overview of Hadoop Ecosystem
- Detailed overview of Spark
- Set-up Amazon Web-Services
- Resources on other SPark Options
- Jupyter Notebook hands-on code with PySpark and RDDs (resilient distributed datasets)
- So far we've worked with datasets that can fit on a local computer (RAM in scale of 0-8GB)
- What we can do if we have larger datasets?
- Try using an SQL DB to move storage on hard drive instead of RAM
- Use a distributed system, that distributes data onto multiple machines/computers (Correct Approach)
- With distrubuted sytem we can harness the power of multiple cores in a scalar way, controlling it from the master node
- A local process will use the computational resources of a single machine / A distributed process has access to the computational resources across a number of machines connected through the network
- Scaling is easier in distributed systems (add dmore nodes). Also we get Fault tolerance
- We will discuss the typical format of a distributed system that uses Hadoop (Apache Hadoop)
- Hadoop is a way to distribute very large files across multiple machines
- It uses the Hadoop Distributed File System (HDFS)
- HDFS allows a user to work with large data sets
- HDFS duplicates blocks of data for fault tolerance
- It uses MapReduce which allows computations on that data
- An HDFS (distributed storage) includes a Name Node (master) and a bunch of Data Nodes (slaves) , each node has its CPU and RAM
- HDFS will use blocks of data, with a default size of 128MB
- Each of these blocks is replicated 3 times and distributed to support fault tolerance
- splitting dat ainto smaller blocks provides prallelization during processing
- multiple copies of a block prevent loss of data due to node failure
- MapReducer is a way of splitting a computation task to a distributed set of files (HDFS)
- It consists of Jop Tracker (Master Node) and multiple Task Trackers (Slave nodes)
- Job tracker sends code to run on the task trackers
- task trackers allocate CPU and memory for the tasks and monitor the tasks on the worker nodes
- The approach on handling BigData we have seen contains 2 parts:
- Use HDFS to distribute large datasets
- Use MapReduce to distribute computational tasks to a distributed data set
- Spark builds on top of these technologies providing imporvemtns
- This lecture gives an abstract overvview on:
- Spark, Spark vs MapReduce, Spark RDDs, RDD Operations
- (Apache) Spark is one of the latest technologies used to quickly and easily hanbdle Big Data
- Open Source. Introduced in 2013 by UC Berkley AMPlab. Very popular due to its ease of use and speed
- Spark is a flexible alternative of MapReduce.
- Spark can use data stored in a varietay of formats: Cassandra, AWS S3, HDFS, ... (Elastic?)
- Spark vs MapReduce:
- MapReduce works only with HDFS stored files. Spark Not
- Spark performs operations up to 100xfaster than MapReduce. How??
- MapREduce writes most data to disk after each map and reduce operation. Spark keeps most of data in-memory after each transformation
- Spark can spill over to disk if memory is filled
- Spark works on the idea of RDD (Resilient Distributed Dataset)
- RDDS have 4 main feats:
- Distributed Collection of Data
- Fault Tolerance
- Parallel operation - partitions
- Ability to use many data sources
- Spark works on a distributed way: A Driver Program running the SparkContext <-Communicates-> Cluster Manager <-Communicates-> Worker Nodes (execute the tasks)
- In an example we have: RDD Objects: build the graph (Build Operator of DAG=DirectedAcyclicGraph) -DAG-> DAGScheduler: split ggraph into stages of tasks (submit each stage as ready) -TaskSet-> TaskScheduler@ClusterManager:launch tasks via cluster manager (retry failed on straggling tasks) -Task-> Worker:execute taks (store and serve blocks)
- Spark allows programmers to develop complex multistep data pipelines using DAGpatterns
- Also it supports in-memory data-sharing across DAGs allowing different jobs to work on same data.
- Spark in python is concerned on building RDDObjects, the rest is done transpoarently
- RDDs are immutable, lazily evaluated and cachable
- There are two types of RDD operators:
- Transformations
- Actions
- That's what we code with python on Spark
- The Basic actions we are going to look at are:
- First: Return the first element in the RDD
- Collect: Return all the elements of the RDD as an array at the driver program
- Count: Return the number of elements in teh RDD
- Take: Return an array with the first n elements of the RDD
- These 4 actions look like methods when we program w/ Python on the RDDobject we create
- The Basic Transformations we are going to look at are:
- filter(): applies a function to each element and returns elements that evaluate to true
- map(): transforms each element and preserves # of elements like pandas .apply()
- flatMap(): transforms each element into 0-N elements and changes # of elements
- Map() e.g grabbing first letter of a list of names
- FlatMap(): transforming a corpus of text into a list of words
- Pair RDDs: often RDDs will hold their values in tuples (key,value). This offers better partitioning of data and leads to functionality based on reduction
- we have two more additional methods to use which are different in RDDs and Pair RDDs:
- Reduce(): An action that will aggregate RDD elements using a function that returns a single element
- ReducByKey(): An action that will aggregate Pair RDD elements using a function that returns a PAir RDD
- these ideas are similar to dataframe groupby() operation
- Spark is under rapid development. Latest Additions to the Ecosystem: Sparq SQL, Saprk DataFrames, Mlib, GraphX, SPark Streaming
- we will set upo AWS account to use Spark in the cloud (also local installation is possible)
- Amazon Guide
- we create an account free
- we get free EC2 S3 RDS. we will use EC2 to deploy a VM for a micro instance
- in our training EC2 instance we keep all ports open. in production we should use strict settings on ports
- EC2 stands for Amazon Elastic Compute Cloud and is a web service that provides resizable compute capacity in the cloud
- we create our EC2 instanc ein AWS and use SSH to connect to it from our local machine
- we setup Spark and Jupyter on teh EC2 Instance
- login to aws -> EC2 -> UBuntu 64bit free tier -> t2.micro -> Conf. Instance Details -> NUm of Instances:1 (higher num for larger datasets) all default => Add storage; all default -> add tags: Key: myspark, val: mymachine => config secuurity group type: all traffic => review & launch => launch => create new keypair key: newspark & downlaod and store .pem file => launch instance
- we see it in console. when we are done we use actions -> instance state -> terminate to remove it and avoid charges
- get the DNS address of our EC2 instance
- we open a terminal and go to the directory where we have the .pem file. we
chmod 400 newspark.pem - we run
ssh -i newspark.pem ubuntu@<publicdns>and connect to our aws ec2 vm
- instructions list
- we install anaconda on our remote EC2 instance
wget http://repo.continuum.io/archive/Anaconda3-4.1.1-Linux-x86_64.sh
bash Anaconda3–4.1.1-Linux-x86_64.sh
- follow the instructions (all defaults). anaconda installs its own python
- we
source .bashrcto add python location in path and confirm it withwhich python - now we need to configure jupyter notebook to work with ssh
jupyter notebook --generate-config. it creates a configuration file for us - we need to create cerifications for our connections (.pem files)
mkdir certs
cd certs
sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem
- we answer the questions and we get teh mycert.pem cert file in our dir
- we now edit the config file. we go to
~/.jupyter/ - we will use vi to edit the
vi jupyter_notebook_config.pyand add
c = get_config()
# Notebook config this is where you saved your pem cert
c.NotebookApp.certfile = u'/home/ubuntu/certs/mycert.pem'
# Run on all IP addresses of your instance
c.NotebookApp.ip = '*'
# Don't open browser by default
c.NotebookApp.open_browser = False
# Fix port to 8888
c.NotebookApp.port = 8888
- we run the notebook with
jupyter notebookit should run on port 8888. we access it from browser using the public dns and the port - proxys and firewalls make it refuse connection though
- spark is written in scala so we need to install scala and scala uses java
sudo apt-get update
sudo apt-get install default-jre
java -version
- with java installed we install scala
sudo apt-get install scala
scala -version
- we need py4j library to allow python to connect to java. to do this we need to make sure our pip install is connected to our anaconda installation path and not the default
export PATH=$PATH:$HOME/anaconda3/bin
- then we use conda to install pip
conda install pipamd test to see our install pathwhich pipto verify its in anaconda - we install py4j with pip
pip install py4j - we will now install spark and hadoop
wget http://archive.apache.org/dist/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
sudo tar -zxvf spark-2.0.0-bin-hadoop2.7.tgz
- we connect python to spark
export SPARK_HOME='/home/ubuntu/spark-2.0.0-bin-hadoop2.7'
export PATH=$SPARK_HOME:$PATH
export PYTHONPATH=$SPARK_HOME/python:$PYTHONPATH
- we relaunch jupyter notebook. open it in a browser and test our installation with importing pyspark
from pyspark import SparkContext
sc = SparkContext()
- we have already seen this in python bootcamp
- lambdas are anonymous functions
- normal func and its condenses forms
def square(num):
result = num**2
return result
def square(num): return num**2
- its lambda equivalent
lambda num: num**2 - transforming a lambda to a normal func
sq = lambda num: num**2we call call it normally like a funcsq(4) - landas are passed into arrays or datasets
rev = lambda s: s[::-1]reversing strings or arrays- we can use multiple var in a lambda adder = lambda x,y: x+y
- we scp the pySpark notebooks to remote server
- we import pySpark
from pySpark import SparkContext - SparkContext represents the connection to the spark cluster. it can be used to create an RDD and broadcast variables on the cluster
- we can have only one context running
- to create it we need to make an instance
sc = SparkContext() - we will try to read a text file. first we have to create it. we will use jupyter notebook commands to write to the file. whatever we put in the cell gets written to the file
%%writefile example.txt
first line
second line
third line
fourth line
- with the file created we can use the SparkContext() textFile method to create an RDD of strings out of a textfile in an FS.
textFile = sc.textFile('example.txt')
- textFile is an RDD object and we can perform operations on it. sc is the sparkcontext that connects to the spark cluster
- we can perform actions and transformations on an RDD object
- we run an action on the RDD
textFile.count()andtextFile.first() - to do more complicated tasks we can do transformations on the RDD to rteturn a new RDD. we apply the filter() method
secFind = textFile.filter(lambda line: 'second' in line) - secFind is a new RDD. the transformation is fast as RDDs are lazily evaluated (transformation does not really execute until we perform an action on the new RDD)
- to use spark with python: we create a SparkCOntext -> create an RDD -> perform actions on it or transformations
-
Important Terms:
- RDD: Resilient Distributed Dataset
- Transformation: Spark operation that produces an RDD
- Action: Spark operation that produces a local object
- Spark Job: Sequence of transformations on data with a final action
-
There are two common ways to create an RDD:
sc.parallelize(array: Create RDD of elements of array or listsc.textFile(path to file): Create RDD of lines from file
-
RDD Transformations:
filter(lambda x: x % 2 == 0): Discard non even elementsmap(lambda x: x*2): Multiply each RDD element by 2map(lamdba x: x.split()): Split each string into wordsflatMap(lambda x: x.split()): Split each string into words and flatten sequencesample(withReplacement-True,0.25): Create sample of 25% of elements with replacementunion(rdd): Append rdd to existing RDDdistinct(): Remove duplicates in RDDsortBy(lambda x: x, ascending=False): Sort eleemnts in descending order
-
RDD Actions:
collect(): Convert RDD to in-memory listtake(3): First 3 elements in RDDtop(3): Top 3 elements in RDDtakeSample(withReplacement=True,3): Create sample of 3 elements with replacementsum(): Find element sum (assumes numeric elements)mean(): Find eleemnt mean (assumes numeric elements)stdev(): Find element deviation (assumens numeric elements)
-
we create a textfile in jupyter
%%writefile example2.txt passing in 4 lines of text
first
second line
the third line
then a fourth line
- we create an sc(SparkCOntext) instance using it to make an RDD out of the file (text_rdd)
- we perform a transformation on the RDD (word split) creating a new one
words = text_rdd.map(lambda line: line.split()) - we convert the new RDD it to an in-memory list
words.collect() - it is a list of nested lists of strings.
- if we perfrom collect() on the original RDD is a list of strings
- if instead of map we perform flatmap transformation on the original RDD and then convert it to an in-memory list
text_rdd.flatmap(lambda line: line.split()).collect()instad of nested lists we get a single list with all the words (flat list) - we ll look into key-value pair RDDs
- we write tabular data in atext file
%%writefile services.txt
#EventId Timestamp Customer State ServiceID Amount
201 10/13/2017 100 NY 131 100.00
204 10/18/2017 700 TX 129 450.00
202 10/15/2017 203 CA 121 200.00
206 10/19/2017 202 CA 131 500.00
203 10/17/2017 101 NY 173 750.00
205 10/19/2017 202 TX 121 200.00
- we use sc to make an RDD out of it and perform an action
.take(2)selecting first 2 eleemtns (lines) as we have single strings per line - we split the lines
services.map(lambda line: line.split()).take(3)our lines are now lists of strings (words) - to remove the hash from first label we do
services.map(lambda line: line[1:] if line[0]=='#' else line).collect()so we rmove the hash if it is the first char in the line string else we leave the line as is. to apply the transformation we chain .collect() to make it in-memory list - we will now use by key lambda methods which assume our data are in key value format
- this format refers back to pair RDDs we talked about before
- to get total sales per state (like groupby in pandas). in spark we dont havegroupby. we must use reducebyke.first we extract the 2 colums of interest using their index in the list or strings. we make it a tuple
pairs = clean.map(lambda lst: (lst[3],lst[-1])) - we use
rekey = pairs.reduceByKey(lambda amt1,amt2: amt1 + amt2 )which works like groupBy. as amounts are strings it just concats strings instead of adding nums - we fix that by converting them to floats
rekey = pairs.reduceByKey(lambda amt1,amt2: float(amt1)) + float(amt2) ) - to remove labels (first tuple) we perform
filter(lambda x: not x[0]=='State') - we sort the results
.sortBy(lambda stateAmount: stateAmount[1], ascending=False)sort by second elemetn in tuples in ascending order - we can use tuple unpacking for readibility
- also meny times we dont use lanbda byt apply our own functions
x = ['ID','State','Amount']
def func1(lst):
return lst[-1]
def func2(id_st_amt):
# Unpack Values
(Id,st,amt) = id_st_amt
return amt
func1(x)
>> 'Amount'
func2(x)
>> 'Amount'
- Nerual Networks aka Reinforced Learing
- Neural Networs are modeled after biological neural networks and attempt to allow computers to learn in a similar manner to hummans - reinforced learning
- Sample use cases are:
- Pattern Recognition
- Time series Predictions
- Signal Processing
- Anomaly Detection
- Control
- The human brain has inteconnected neurons with dentrites that receive input, based on those inputs produce an electrical signal output through the axon to the next neuron
- Why ANN?? There are problems difficult for humans but easy for computers (calculating complex arithmetic problems). There are problems easy for humans but difficult for computers (recognize a picture of a person from the side)
- Neural networks attempt to solve problems that would normally be easy for humans but hard for computers
- We will have a look at the simplest Neural Network possible: the Perceptron
- A perceptron consists of one or more inputs, a processor, and a single output
- It follows the "feed-forward" model: inputs are sent to the neuron, are processed and result in an output
- A perceptron process follows 4 main steps: 1) Receive Inputs 2) Weight Inputs 3) Sum Inputs 4) Generate Output
- usually the weighted value in a neuron is a number between -1 and 1
- weights are per input
- the output of a perceptron is generated by passing that sum through an activation function. inthe case f a simple binary output, the activation function is what tells the perceptron to fire or not
- There are many activation functions (Logistic, trigonometric, step etc), for our example we will make the activation function teh sign of the sum. if thesum is positive the output is 1, if it is negative the output is -1
- We should always have bias in mind. if both inputs were zero, the sum no matter weight we would apply would be zero. to avoid the problem, a 3rd input called bias is added with a value of 1. this avoids the zero issue
- To train the perceptron we use the following steps:
- 1 provide the perceptron with inputs for which there is a known answer
- 2 ask the perceptron to guess an answer
- 3 compute the error (how far off it is from the correct answer)
- 4 adjust all the weights according to the error
- 5 return to step 1 and repeat untill we reach an error we are OK with (set beforehand)
- To create a neural network all we have to do is link many percetrons together in layers
- In neural networks we have one input and one output layer. any layers in between are known as hidden layers (we don't see anything but the io)
- Deep Learning is used for a NEural Network with many hidden layers causing it to be deep. (e.g state of the art vision recognition uses 152 layers)
- We will see what TensorFlow lib is and how we can use it
Lecture 133 - What is Tensorflow
- TensorFlow is an opensource sw lib developed by Google
- it has quickly becomethe most popular Deep Learning Library in the field
- it can run either CPU or GPU. typical DNNs run much faster on GPU
- the basic idea of TensorFlow is to be able to create dataflow graphs
- these graphs have nodes (hidden layers) and edges. as we saw in the lecture about NNs
- the arrays(data) passed along from layer of nodes to layer of nodes is known as a Tensor
- There are 2 ways to use TensorFlow
- Customizable Graph Session
- SciKit-Learn type interface with COntrib.Learn
- We will go over both methods. the Session method can feel very overwhelming without a strong background in mathematics
- in tensorflow site it says how to install tensorflow with python on anaconda installation
- mac/linux
conda install -c conda-forge-tensorflow - we choose the virtualenv installation with pip (recommended) being in our virtualenv (source activate plotly)
pip install -U tensorflow-gpufor gpu support (needs a lot of config on machine) we will install simplepip install -U tensorflowwith only CPU support to finish the course
- we check installation with importing tensorflow
import tensorflow as tf - we create a simple constant with tensorflow
hello = tf.constant('Hello World') - we check the type
type(hello)=> tensorflow.python.framework.ops.Tensor. it is stored as a tensor object - we create a num constant which is also stored a tensor
- we create a session object
sess = tf.Session()a tensorflow session is the environment where Operation objects get executed and tensor objects are evaluated in these operations - if we call the run() method ina session and pass a tensor it gets executed
sess.run(hello)we get a string and its type() is bytes. thetype(sess.run(x))and is numpy.int32 - we can run multiple operations on multiple constants in tensorflow sessions in a class type synstax
x = tf.constant(2)
y = tf.constant(3)
with tf.Session() as sess:
print('Operations with Constants')
print('Addition',sess.run(x+y))
print('Subtraction',sess.run(x-y))
print('Multiplication',sess.run(x*y))
print('Division',sess.run(x/y))
- many times we dont have constants readily avaialble. so we need to define a Placeholder to accept values
- we define the vars as placeholders of a specific (tensorflow builtin) type
x = tf.placeholder(tf.int32) - there are many builtin types available (gpu enabled tensorflow has even more)
- we can use tf built-in operations on placeholders
add = tf.add(x,y)or div, mul, sub - we can use sessions to wrap multiple buildin operations running on multiple placehoders which we pass as dictionaries
with tf.Session() as sess:
print('Operations with Placeholders')
print('Addition',sess.run(add,feed_dict={x:20,y:30}))
- we will do a more complex operation (matrix multiplication)
- we create the matrices
import numpy as np
a = np.array([[5.0,5.0]]) #1by2
b = np.array([[2.0],[2.0]]) #2by1
- we pass them as tensorflow constants
mat1 = tf.constant(a)
mat2 = tf.constant(b)
- we can use the builtin tensorflow mtrix multiplication function
matrix_multi = tf.matmul(mat1,mat2) - we run it in tensorflow session context
with tf.Session() as sess:
result = sess.run(matrix_multi)
- WrapUp
- we can pass any object as a constant in tensorflow converting it to a tensor
- we can run a session in tensorflow passing a constant or an operation
- we will use the famous MNIST dataset of handwritten digits
- we import tensorflow
- we grab the dataset which is a builtin dataset of tensorflow
from tensorflow.examples.tutorials.mnist import input_data - we create a variable for the data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)making a temp dir for the data - the data are handwritten digits (0-9)
type(mnist)reveal the type that is a datasetmnist.train.images.shapegives the sizew of the train images 55000 images of 784 bits size- a single train image
mnist.train.images[2].shapeis a 1d vector. if we resape it to a 2d of size 28by28mnist.train.images[2].reshape(28,28)is a matrix with the darkness set as a 0-1 value - we can visualize it with matplotlib
plt.imshow(sample,cmap='Greys')using the imshow() method - with the dataset in memory we are now ready to define some params to be used by our multilayer perceptron DNN: learning rate, number of training epochs, batch size)
learning_rate = 0.001 # how quickly we adjust the cost function
training_epochs = 15 # how many training cycles we go through
batch_size = 100 # size of batches of training data
- then we define network params that will define how the muli-layer perceptron will look like
n_classes = 10 # how many classes our DNN will output (num 0 -9)
n_samples = mnist.train.num_examples # samples direcly taken out of dataset size
n_input = 784 # how we expect the input to our DNN to look like (size of sample)
n_hidden_1 = 256 # size of hidden layers of DNN (256 neurons is a typical number for image recognition for 8bit color storage)
n_hidden_2 = 256
- we will take the input and send it to the 1 hidden layer
- then the data will begin to have weights ataached to it between layers (also we add bias)
- data will flow to 2nd layer and then to the aoutput
- more hidden layers => better results => more training time
- once the data come s to the output we need to evaluate it. there comes the reinforcement part
- we will use a loss function or cost funtion to evaluate how far we are from the desired results
- we will apply an optimization function trying to minimize the loss by adjusting the weights (we will use the adam optimizer)
- lower learning rate gives better results but longer exec time
- we will create a function for our mulit-layer perceptron. it will take 3 arguments input x, weights and bias it will have 2 layers and use the RELU or rectifier activation function. for the fianl output layer we will use linear activation with matrix multiplication
- tf has a multitude of activation functions
def multilayer_perceptron(x,weights,biases):
'''
x: Placeholder for DataInput
weights: Dict of weights
biases: Dict of bias values
'''
# First Hidden Layer with RELU Activation Function
# Step 1: X * W + B
layer_1 = tf.add(tf.matmul(x,weights['h1'],biases['b1'])
# Step 2: RELU(X * W + B) -> f(x) = max(0,x)
layer_1 = tf.nn.relu(layer_1)
# Second Hidden Layer
layer_2 = tf.add(tf.matmul(layer_1,weights['h2']),biases['b2'])
layer_2 = tf.nn.relu(layer_2)
# Last Output Layer
out_layer = tf.matmul(layer_2,wights['out']) + biases['out']
return out_layer
- we will look into the dictionaries of weights and bias as we need to create them to pass them into the function. to define them we will use the tensorflow object type called Variable
- tensorflow has a Graph object aware of the states of all variables
- Variable is a modifiable Tensor that lives in tf Graph of interacting operations, modified by the computations. that's why we use Variables
- weights start as random and become adjusted during epochs (iterations)
- the are a matrix of imput size by layer neurons count
weights = {
'h1': tf.Variable(tf.random_normal([n_input,n_hidden_1])), # random_normal outputs random nums of a normal distribution
'h1': tf.Variable(tf.random_normal([n_hidden_1,n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_hidden_2,n_classes]))
}
- the output will have an 1 on the predicted class index, the rest will be 0
- we define biases in the same way. we can set them initially to 1 or set them random
biases = {
'b1': tf.Variable(tf.random_normal([n_hidden_1])),
'b1': tf.Variable(tf.random_normal([n_hidden_2])),
'out': tf.Variable(tf.random_normal([n_classes]))
}
- we need to define placeholder for x setting the shape to none by n_input
x = tf.placeholder('float',[None,n_input]) - we do the same for outputs y
y = tf.placeholder('float',[None,n_classes]) - we now have all pieces in place so we can contruct our model
pred = multilayer_perceptron(x,weights,biases)
- we will define our cost optimization function (ADAM)
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(pred,y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
- our model is ready so we can train it
- next_batch method(batchsize) returns a tuple of 2 elements with the next batch of data
mnist.train.next_batch(10)for our dataset the array is 784 size long and 10 size deep (num of samples) - the first element is the sample and the second is the target value (class)
- we will run the session as an Interactive session not as a class style wrapper
sess = tf.InteractiveSession() - we initialize our variables
tf.initialize_all_variables() - we run our session
sess.run(init) - we now can do the actual training in epochs
# 15 loops
for epoch in range(training_epochs):
# cost
avg_cost = 0.0
total_batch = int(n_sample/batch_size)
for i in range(total_batch):
batch_x,batchy = mnist.train.next_batch(batch_size)
# tuple unpacking with _ as i dont need first val. c is the calculated loss
_,c = sess.run([optimizer,cost],feed_dict={x:batch_x,y:batch_y})
avg_cost += c/total_batch
print("Epoch: {} cost{:.4f}".format(epoch+1,avg_cost))
print("Model has completed {} Epochs of training".format(training_epochs))
- we run the training session. our avg cost drops in each iteration
- we can evaluate the model using tensorflow built in tools
- first we count the correct predictions doing an equalcheck
correct_predictions = tf.equal(tf.argmax(pred,1),tf.argmax(y,1))on the predicted 1s to the actual 1s - this returns booleans so wqe need to cast it to a number
correct_predictions = tf.cast(corrent_predictions, 'float') - we get the average
accuracy = tf.reduce_mean(correct_predictions)which is a tensor object - we evaluate the result
accuracy.eval({x:mnist.test.images,y:mnist.test.labels}) - our accuracy is 94.4%
- raising the num of epochs can raise the accuracy to 99.9% (state of the art results)
- using TensorFlow with Sessions can get too complicated
- there a community based Sci-Kit Learn interface to TensorFDlow called SKFlow
- It became so popular that is now officialy part of TensorFlow, under Contrib
- we use it in jupyter notebook
- we import our dataset (IRIS)
from sklearn.datasets import load_irisand instantiate itiris = load_iris() - it is a dictionary. we check the keys
iris.keys()we will use only the data and target
X = iris['data']
y = iris['target']
- we split the data in train test sets using sklearn like before
- we build our model in normal sklearn style. first we import the lib
import tensorflow.contrib.learn.python.learn as learn - learn includes a multitude of tensorflow models available. we use DNNClassifier
classifier = learn.DNNClassifier(hidden_units=[10,20,10],n_classes=3)hidden_units is the num of nodes per layer, num of classes is self explanatory - we fit our classifier on our dataset adding DNN specific params like batch size and steps or epochs.
classifier.fit(X_train,y_train,steps=200,batch_size=32) - we get our predictions and evaluate them the normal way (using the confusion matrix and classification report)
- We will do false bank note identification using UCI sample data
- we will prepare our data before using contrib.learn to classify them using DNN
- when our model evaluation is extremely well we compare our resuslts with the results from another model
- convert pandas dataframe to numpy array (matrix) to import it into tensorflow (tensorflow accepts numpy arrays)
array = df.as_matrix()