# Galvanize: Intro to Python for Data Science
- Wednesday, January 25, 2017
- Galvanize SF

-	Metis, Galvanize, General Assembly

- Itinerary
    - 6:00 pm - Networking & Announcements
    - 6:30 pm - Why Use Python for Data Science? 
    - 6:50 pm - Working with Python’s NumPy, SciPy 
    - 7:50 pm - Working with Pandas (DataFrames) 
    - 8:30 pm - Statistical Analysis in Python 
    - 9:00 pm - Wrap-Up and Additional Resources

## Introduction to Python for Data Analysis, Jared Thompson

-	Agenda
		- Why Python
		- Jupyter Notebook
		- Getting Started with Python
		- Common Data Science/Analysis Libs
		- Tricks
		- Where to Go Next
-	What to Expect?
		- Quick overview (Python / DS)
		- Introduction
		- Advanced stuff
		- Get outside of your comfort zone
		- Be patient
		- Practice at home
-	Why Python?
		- Scalability
		- Extensive [Data] Analytics Lis &Community
		- SciPy.org (Mathematics, Science, Engineering)
		- StatsModels (Statistics)
		- Pandas (Frameworks)
		- SciKit-Learn (ML)
		- Graphics
		- Libraries (ggplot, matplotlib)
		- APIs – Plot.ly
		- Easy to learn and code
-	What is Python?
		- An Open source, high-level, dynamic scripting language
		- OS: Free! (both binaries and source files)
		- High-level: interpreted (add(a, b)); not compiled
		- Dynamic: Things that would happen at compile time happen at runtime instead
		- Dynamic Typing
-	Strengths and Weaknesses
		- Strengths
		- Python's popularity comes for the strength of design
		- BDFL (benevolent dictator for life): Guido Van Rossum
		- Has unified design philosophy
		- Emphasizes reusability and readability of code
		- Weaknesses
		- Slower than lower-level languages
-	Python, how?
		- Anaconda Project (http://continuum.io/anaconda)
		- iPython Notebook (http://ipython.org)
-	Other Tools
		- Julia
		- R
		- Matlab
		- Strata (SPSS, SAS)
-	Books
		- Python
		- Think Python
		- Python for Data Analysis
-	Python in the Data Science Workflow
-	What is Data Science/Data Exploration?
		- Extract useful information and knowledge from data
		- Clarify info from large data
		- Imrpove decision-make
		- Tell compelling stories
-	Python as a tool for Data Science
		- Diagram: Doing Data Science // O’Reilly 2014 
-	Data Preparation
		- The goal of “Pre-Processing” is to convert data into a standard format
		- Standard format allows for input to alogirthms
-	Analysis/Model
		- Different algorithms for optimal choice
		- SciPy
		- Scikit-learn
		- StasModels
		- Built as “black boxes”
-	Vizualization
		- Since seeing is believing…
		- Prettyplotlib
-	Data Science Library Introduction
		- Numpy: Vectors, arrays, matrices, tensors (matrices of matrices)
		- Matplotlib
		- Scipy
		- Pandas
		- Scikit-learn
		- StatsModels
-	Basic Data Structures
		- Array, Lists => NumPy Array

In [94]:
import numpy as np
# slicing
newarray = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(newarray[4:-1])
print(newarray[:3])
print(newarray[-1])
print(newarray[-2:])

[5, 6, 7, 8]
[1, 2, 3]
9
[8, 9]


		- Tuples => immutable arrays of arbitrary elements
		- String
		- Associative Arrays (hash tables)
		- Sets
		- Control Flow
-	NumPy

In [95]:
# numpy
np.zeros((5,5))

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [96]:
np.ones((5,5))

array([[ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.]])

		- Matrix multiplication

In [97]:
# matrix mul
nr = newarray[:5]
np.ones((5,5)).dot(nr)

array([ 15.,  15.,  15.,  15.,  15.])

		- Matrix object
		- Ndarray element-wise operations 
-	Pandas
		- “OS library providing high-performance data structures and data analysis tools”
		- For bigger data, not gigantic (that’s for Spark)
		- Examples

In [98]:
# pandas
import pandas as pd
import numpy as np
s = pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
s

a   -0.583679
b    0.014482
c   -1.029951
dtype: float64

In [99]:
s['a']

-0.58367861193488446

In [100]:
s[0]

-0.58367861193488446

In [101]:
d = {'a' : 0, 'b': 1, 'c': 2}
pd.Series(d)

a    0
b    1
c    2
dtype: int64

In [102]:
#"trick" / help(!)
s = pd.Series?

In [104]:
s = pd.Series(np.arange(4), index=['a', 'b', 'c', 'd'])#
s + s

a    0
b    2
c    4
d    6
dtype: int32

In [105]:
t = pd.Series([30,40], index=['b', 'c'])
t

b    30
c    40
dtype: int64

In [106]:
s + t

a     NaN
b    31.0
c    42.0
d     NaN
dtype: float64

•	All the marketing data for the last 2 years
•	A year of chemical engineering data
		- Series – 1D labeled array
		- Series indexed by a series of labels
		- Can be addressed or sliced like an array
		- Somewhat akin to dictionaries and easy to convert a dict to a series
		- “Trick”/Help(!)
		- Provides vector operation support and index alignment
		- DataFrame => 2D table

In [107]:
# DataFrame
d = {'one' : [10., 20., 30., 40.], 'two' : [4., 3., 2., 1.]}
d

{'one': [10.0, 20.0, 30.0, 40.0], 'two': [4.0, 3.0, 2.0, 1.0]}

In [108]:
df = pd.DataFrame(np.floor(np.random.randn(3,4)*10), columns=['A', 'B', 'C', 'D'])
df #np.random.randn(3,4) -> (row, column) ; #floor - "floor toward zero"

Unnamed: 0,A,B,C,D
0,-2.0,7.0,-4.0,11.0
1,-1.0,5.0,-2.0,9.0
2,5.0,11.0,-12.0,-8.0


In [109]:
df2 = pd.DataFrame(np.floor(np.random.randn(3,2)*10), columns=['A', 'B'])
df2

Unnamed: 0,A,B
0,-12.0,-9.0
1,-18.0,0.0
2,6.0,13.0


In [110]:
# creating a dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'int_col' : [1, 2, 6, 8, -1],
                   'float_col' : [0.1, 0.2, 0.2, 10.1, None],
                   'str_col' : ['a', 'b', None, 'c', 'a']})
df

Unnamed: 0,float_col,int_col,str_col
0,0.1,1,a
1,0.2,2,b
2,0.2,6,
3,10.1,8,c
4,,-1,a


In [111]:
# creating a dataframe
import pandas as pd
import numpy as np
df = pd.DataFrame({'int_col' : [1, 2, 6, 8, -1],
                   'float_col' : [0.1, 0.2, 0.2, 10.1, None],
                   'str_col' : ['a', 'b', None, 'c', 'a']})
df

Unnamed: 0,float_col,int_col,str_col
0,0.1,1,a
1,0.2,2,b
2,0.2,6,
3,10.1,8,c
4,,-1,a


		- Some notable features
		- Data loading
		- Data selection using indexes
		- Group
		- Indexing

In [112]:
# indexing
df.ix[:, ['float_col', 'int_col']] #[rows, col] # data extraction

Unnamed: 0,float_col,int_col
0,0.1,1
1,0.2,2
2,0.2,6
3,10.1,8
4,,-1


In [113]:
df[['float_col', 'int_col']] # another way

Unnamed: 0,float_col,int_col
0,0.1,1
1,0.2,2
2,0.2,6
3,10.1,8
4,,-1


		- Conditional Indexing

In [114]:
# conditional indexing
df['float_col'] > 0.15

0    False
1     True
2     True
3     True
4    False
Name: float_col, dtype: bool

In [115]:
df[df['float_col'] > 0.15]

Unnamed: 0,float_col,int_col,str_col
1,0.2,2,b
2,0.2,6,
3,10.1,8,c


In [116]:
arg1 = df['float_col'] > 0.1
arg2 = df['int_col'] > 2
# df[(df['float_col'] > 0.1) & (df['int_col'] > 2)]
df[arg1 & arg2] # and

Unnamed: 0,float_col,int_col,str_col
2,0.2,6,
3,10.1,8,c


In [117]:
# df[(df['float_col'] > 0.1) | (df['int_col'] > 2)]
df[arg1 | arg2] # or

Unnamed: 0,float_col,int_col,str_col
1,0.2,2,b
2,0.2,6,
3,10.1,8,c


In [118]:
# df[~(df['float_col'] > 0.1)]
df[~arg1] # invert

Unnamed: 0,float_col,int_col,str_col
0,0.1,1,a
4,,-1,a


In [119]:
# vectorized math operations
df = pd.DataFrame(data = {"A": [1,2], "B": [1.2, 1.3]})
df["C"] = df["A"] + df["B"]
df

Unnamed: 0,A,B,C
0,1,1.2,2.2
1,2,1.3,3.3


In [120]:
df["D"] = df["A"]*3
df

Unnamed: 0,A,B,C,D
0,1,1.2,2.2,3
1,2,1.3,3.3,6


In [121]:
df["E"] = np.sqrt(df["A"])
df

Unnamed: 0,A,B,C,D,E
0,1,1.2,2.2,3,1.0
1,2,1.3,3.3,6,1.414214


		- Column Renaming
		- Remove Missing Values => .dropna()
		- Replace Missing Values => .fillna()
		- Map & Apply Functions
•	Map => map(lambda x : ‘map_’ + x)
•	Apply => apple(np.sqrt)
		- Applymap => .applymap()
		- Vectorized String Operations
		- Groupby

In [122]:
# groupby
df4 = pd.DataFrame({'int_col' : [1, 2, 6, 8, -1],
                   'float_col' : [0.1, 0.2, 0.5, 10.1, 0.1],
                   'str_col' : ['a', 'b', 'b', 'c', 'a']})
grouped = df4['float_col'].groupby(df4['str_col'])
grouped.mean()

str_col
a     0.10
b     0.35
c    10.10
Name: float_col, dtype: float64

In [123]:
g2 = df4.groupby(['float_col', 'str_col'])
g2.sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,int_col
float_col,str_col,Unnamed: 2_level_1
0.1,a,0
0.2,b,2
0.5,b,6
10.1,c,8


		- Statistics
•	Covariance Tables
•	Correlation Tables
		- Merge and Join
•	Merge => total join
•	Join => left join
•	Mount one table to another so that you have 2 tables linked together based on their labels
		- Basic Plotting: Lines => df.plot()
		- Histograms => df.hist()
-	Resources & Next Steps
		- After Jupyter => Pycharm
		- http://cli.learncodethehardway.org
		- http://learnpythonthehardway.org
		- http://khanacademy.org/math/probability/regression
		- Anaconda
		- http://wakari.io => Online IDE
		- Google for Education => https://developers.google.com/edu/python/set-up
-	Question
		- Want to find out how much you weigh in stone. A concise program can make short work of this task. Since stone is 14 pounds, and there are 2.2 pounds in a kilogram, the formula:
		- m¬¬stone = (mkg x 2.2) / 14
-	Question
		- Find all numbers divisible by 7 but not a multiple of 5, between 2000 and 3200
