# Python Libraries

Commonly used Python libraries:

* **Numpy (np)** a library for working with arrays of data

* **Pandas (pd)** provides high-performance, easy-to-use data structures and data analysis tools

* **Scipy** a library of techniques for numerical and scientific computing (only import specific functions)

* **Matplotlib.pyplot (plt)** plotting library for making graphs (only import the **pyplot** module)

* **Seaborn (sns)** a higher-level interface to Matplotlib that can be used to simplify many graphing tasks

* **Statsmodels.api (sm)** provides classes and functions for the estimation of many different statistical models

* **scikit-learn** a machine learning library with many of the same statistical techiques as Statsmodels (only import specific functions) 
  
  
NOTE: libraries can be imported in their entirety (with an alias) or just a portion of the library can be imported.  
* Importing a *library*: **import** numpy as np
* Importing a *module*: **import** matplotlib.pyplot as plt
* Importing a *function*: **from** numpy **import** arange  


### Commonly used sample datasets

Seaborn sample datasets  
https://github.com/mwaskom/seaborn-data  
(used as a target for the seaborn.load_dataset function)

scikit-learn sample datasets (need to be conveted into datafames)  
https://scikit-learn.org/stable/datasets/index.html#toy-datasets  
(note: scikit-learn dataset objects have separate "data", "feature_names", and "target")



### Commonly used functions  

* **pd.read_csv** Pandas function to read *.csv files
* **df.describe()** produces descriptive stats on the numeric variables in a dataframe named "df"
* **df.head()** shows the top of the dataframe




The following code loads the Boston housing data from scikit-learn and creates a Pandas dataframe.

In [1]:
# Converting a scikit-learn sample dataset to a Pandas dataframe
# NOTE: scikit-learn sample datasets are a "bunch"
#    that includes "data", "feature_names", and "target"
import pandas as pd
from sklearn.datasets import load_boston
boston_data = load_boston()
#type(load_boston)   # type is: function (to create a sklean "bunch")
#type(boston_data)   # type is: sklearn.utils.Bunch
df_boston = pd.DataFrame(boston_data.data,columns=boston_data.feature_names)
df_boston['target'] = pd.Series(boston_data.target)  # appends a new column

df_boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [2]:
# Vega Datasets from Jake Vanderplas, author of Python Data Science Handbook
# https://cmdlinetips.com/2018/04/vega_datasets-a-python-package-for-datasets/

# install vega_datasets
# !pip install vega_datasets

# import vega_dataets
from vega_datasets import data
# check to see the list of data sets
data.list_datasets()

['7zip',
 'airports',
 'annual-precip',
 'anscombe',
 'barley',
 'birdstrikes',
 'budget',
 'budgets',
 'burtin',
 'cars',
 'climate',
 'co2-concentration',
 'countries',
 'crimea',
 'disasters',
 'driving',
 'earthquakes',
 'ffox',
 'flare',
 'flare-dependencies',
 'flights-10k',
 'flights-200k',
 'flights-20k',
 'flights-2k',
 'flights-3m',
 'flights-5k',
 'flights-airport',
 'gapminder',
 'gapminder-health-income',
 'gimp',
 'github',
 'graticule',
 'income',
 'iowa-electricity',
 'iris',
 'jobs',
 'la-riots',
 'londonBoroughs',
 'londonCentroids',
 'londonTubeLines',
 'lookup_groups',
 'lookup_people',
 'miserables',
 'monarchs',
 'movies',
 'normal-2d',
 'obesity',
 'ohlc',
 'points',
 'population',
 'population_engineers_hurricanes',
 'seattle-temps',
 'seattle-weather',
 'sf-temps',
 'sp500',
 'stocks',
 'udistrict',
 'unemployment',
 'unemployment-across-industries',
 'uniform-2d',
 'us-10m',
 'us-employment',
 'us-state-capitals',
 'volcano',
 'weather',
 'weball26',
 'wheat',

In [3]:
gapminder = data.gapminder()
gapminder.head()

Unnamed: 0,year,country,cluster,pop,life_expect,fertility
0,1955,Afghanistan,0,8891209,30.332,7.7
1,1960,Afghanistan,0,9829450,31.997,7.7
2,1965,Afghanistan,0,10997885,34.02,7.7
3,1970,Afghanistan,0,12430623,36.088,7.7
4,1975,Afghanistan,0,14132019,38.438,7.7


In [4]:
import pandas as pd
#df1=pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
#df1=pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv')
df=pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv')
df.head()


Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [5]:
# Check the data types
df.dtypes


mpg             float64
cylinders         int64
displacement    float64
horsepower      float64
weight            int64
acceleration    float64
model_year        int64
origin           object
name             object
dtype: object

In [6]:
# Get dataframe shape (number of rows & columns)
df.shape

(398, 9)

In [7]:
# Get the number of rows in the dataframe (runs quicker that the shape function)
len(df.index)

398

In [8]:
# Get summary statistics on all numeric (int & float) variables
df.describe()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
count,398.0,398.0,398.0,392.0,398.0,398.0,398.0
mean,23.514573,5.454774,193.425879,104.469388,2970.424623,15.56809,76.01005
std,7.815984,1.701004,104.269838,38.49116,846.841774,2.757689,3.697627
min,9.0,3.0,68.0,46.0,1613.0,8.0,70.0
25%,17.5,4.0,104.25,75.0,2223.75,13.825,73.0
50%,23.0,4.0,148.5,93.5,2803.5,15.5,76.0
75%,29.0,8.0,262.0,126.0,3608.0,17.175,79.0
max,46.6,8.0,455.0,230.0,5140.0,24.8,82.0


### Do car from different origins differ from each other?  
1. group the data frame by car brand
2. compare means for the different car brands


In [9]:
# Q: How many unique values does origin have?
#df['origin'].nunique()        # Count distict values
df['origin'].value_counts()    # Produces a simple frequency table

usa       249
japan      79
europe     70
Name: origin, dtype: int64

In [10]:
# group the data on origin
df.groupby('origin').mean()

Unnamed: 0_level_0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year
origin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
europe,27.891429,4.157143,109.142857,80.558824,2423.3,16.787143,75.814286
japan,30.450633,4.101266,102.708861,79.835443,2221.227848,16.172152,77.443038
usa,20.083534,6.248996,245.901606,119.04898,3361.931727,15.033735,75.610442


### Do car brands differ from each other?  
1. extract the car brand from the "name" field  
2. group the data frame by car brand  
3. compare means for the different car brands  


In [11]:
# Extract the string of text up to the first space (' ')
# https://stackoverflow.com/questions/45019319/pandas-split-a-string-and-then-create-a-new-column

df['brand'] = df['name'].str.split(' ').str[0]
df['brand'].value_counts()

# NOTE: this "brand" variable needs some data cleaning


ford             51
chevrolet        43
plymouth         31
dodge            28
amc              28
toyota           25
datsun           23
buick            17
pontiac          16
volkswagen       15
honda            13
mercury          11
oldsmobile       10
mazda            10
peugeot           8
fiat              8
audi              7
chrysler          6
vw                6
volvo             6
renault           5
opel              4
saab              4
subaru            4
chevy             3
cadillac          2
maxda             2
bmw               2
mercedes-benz     2
hi                1
toyouta           1
chevroelt         1
capri             1
nissan            1
mercedes          1
triumph           1
vokswagen         1
Name: brand, dtype: int64

In [12]:
import numpy as np
import pandas as pd
import random

pd.__version__   # check Pandas version number
#pd.show_versions(as_json=False)   # check the version of Panda's dependencies

'1.0.1'

In [13]:
# Generating fake data for demo purposes

# list of names 
list_names=['Jake Barnes', 'Quincy Wagstaff', 'James Bond', 'Emma Bovary',
    "Scarlett O’Hara", 'Holden Caulfield', 'Ichabod Crane', 'Eliza Doolittle',
    'Nancy Drew', 'Frodo Baggins', 'Edwin Drood', 'Sam Spade', 'Dorian Gray',
    'Mike Hammer', 'Matt Helm', 'Stanley Kowalski', 'Willy Loman', 'Philip Marlowe',
    'Perry Mason', 'Walter Mitty', 'Hercule Poirot', 'Mary Poppins',
    'Christopher Robin', 'Tom Sawyer', 'Becky Sharp', 'Anne Shirley',
    'Becky Thatcher', 'Emma Woodhouse', 'Harry Flashman', 'Elizabeth Bennett',
    'Anne Elliot', 'Jane Eyre', 'Moll Flanders', 'John Yossarian', 'Atticus Finch',
    'Jay Gatsby', 'Lois Lane', 'Victor Frankenstein', 'Veruca Salt', 'Marion Crane',
    'Holly Golightly', 'Rufus Firefly', 'Bilbo Baggins', 'Otis Driftwood',
    'Hugo Hackenbush', 'Charlie McCarthy', 'Mary Richards']

# Calling DataFrame constructor on list 
df = pd.DataFrame(list_names, columns=['Names'])

# Generate fake dates
np.random.seed(1)
df['temp_month'] = np.random.randint(1, 13, df.shape[0])
df['temp_yr'] = np.random.randint(1960, 1991, df.shape[0])
df['temp_day'] = np.random.randint(1, 29, df.shape[0])

df


Unnamed: 0,Names,temp_month,temp_yr,temp_day
0,Jake Barnes,6,1985,8
1,Quincy Wagstaff,12,1963,20
2,James Bond,9,1964,11
3,Emma Bovary,10,1984,15
4,Scarlett O’Hara,12,1977,1
5,Holden Caulfield,6,1971,25
6,Ichabod Crane,1,1972,24
7,Eliza Doolittle,1,1986,24
8,Nancy Drew,2,1980,2
9,Frodo Baggins,8,1976,18


In [14]:
# Need to figure out looping and conditional compute (if?)

# NOTE: this code generates one random number and then loops through the data set to apply it.
for index, row in df.iterrows(): 
    df['temp_day2'] = np.random.randint(1, 29)
df.head()



Unnamed: 0,Names,temp_month,temp_yr,temp_day,temp_day2
0,Jake Barnes,6,1985,8,22
1,Quincy Wagstaff,12,1963,20,22
2,James Bond,9,1964,11,22
3,Emma Bovary,10,1984,15,22
4,Scarlett O’Hara,12,1977,1,22


In [15]:
#df['Feb']= (df['temp_month'] == 2)    # creates a boolean
#df
df['temp_day'] = np.random.randint(1, 29, df.shape[0])
df.head()


Unnamed: 0,Names,temp_month,temp_yr,temp_day,temp_day2
0,Jake Barnes,6,1985,8,22
1,Quincy Wagstaff,12,1963,1,22
2,James Bond,9,1964,26,22
3,Emma Bovary,10,1984,20,22
4,Scarlett O’Hara,12,1977,26,22


In [None]:
# applying groupby() function to 
# group the data on team value. 
gk = df.groupby('Team') 
  
# Let's print the first entries 
# in all the groups formed. 
gk.first() 



In [None]:
# Finding the values contained in the "Boston Celtics" group 
gk.get_group('Boston Celtics') 



In [None]:
df.groupby(['Animal']).mean()

One term that’s frequently used alongside .groupby() is **split-apply-combine**. This refers to a chain of three steps:

1. **Split** a table into groups
2. **Apply** some operations to each of those smaller tables
3. **Combine** the results


In [None]:
# For a sorted dataframe, compute pct_change from pervious row
#pct_change = close[1:]/close[:-1]

# import vega_dataets
from vega_datasets import data

# check to see the list of data sets
#data.list_datasets()

gapminder = data.gapminder()
gapminder.head()

#pct_change = close[1:]/close[:-1]


NOTE: From this point forward, this notebook needs to be re-written

### Utilizing Library Functions

After importing a library, its functions can then be called from your code by prepending the library name to the function name.  For example, to use the '`dot`' function from the '`numpy`' library, you would enter '`numpy.dot`'.  To avoid repeatedly having to type the libary name in your scripts, it is conventional to define a two or three letter abbreviation for each library, e.g. '`numpy`' is usually abbreviated as '`np`'.  This allows us to use '`np.dot`' instead of '`numpy.dot`'.  Similarly, the Pandas library is typically abbreviated as '`pd`'.

The next cell shows how to call functions within an imported library:

In [None]:
a = np.array([[1,2],[3,4]]) 
b = np.array([[11,12],[13,14]]) 

np.dot(a,b)

As you can see, we used the dot() function within the numpy library to calculate the dot product of two arrays, a and b.

# Data Management

Data management is a crucial component to statistical analysis and data science work.  The following code will show how to import data via the pandas library, view your data, and transform your data.

The main data structure that Pandas works with is called a **Data Frame**.  This is a two-dimensional table of data in which the rows typically represent cases (e.g. Cartwheel Contest Participants), and the columns represent variables.  Pandas also has a one-dimensional data structure called a **Series** that we will encounter when accesing a single column of a Data Frame.

Pandas has a variety of functions named '`read_xxx`' for reading data in different formats.  Right now we will focus on reading '`csv`' files, which stands for comma-separated values. However the other file formats include excel, json, and sql just to name a few.

There are many other options to '`read_csv`' that are very useful.  For example, you would use the option `sep='\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas.  See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for '`read_csv`'.

### Importing Data

In [None]:
# Store the url string that hosts our .csv file
url = "Cartwheeldata.csv"

# Read the .csv file and store it as a pandas Data Frame
df = pd.read_csv(url)

# Output object type
type(df)

### Viewing Data

In [None]:
# We can view our Data Frame by calling the head() function
df.head()

The head() function simply shows the first 5 rows of our Data Frame.  If we wanted to show the entire Data Frame we would simply write the following:

In [None]:
# Output entire Data Frame
df

As you can see, we have a 2-Dimensional object where each row is an independent observation of our cartwheel data.

To gather more information regarding the data, we can view the column names and data types of each column with the following functions:

In [None]:
df.columns

Lets say we would like to splice our data frame and select only specific portions of our data.  There are three different ways of doing so.

1. .loc()
2. .iloc()
3. .ix()

We will cover the .loc() and .iloc() splicing functions.

### .loc()
.loc() takes two single/list/range operator separated by ','. The first one indicates the row and the second one indicates columns.

In [None]:
# Return all observations of CWDistance
df.loc[:,"CWDistance"]

In [None]:
# Select all rows for multiple columns, ["CWDistance", "Height", "Wingspan"]
df.loc[:,["CWDistance", "Height", "Wingspan"]]

In [None]:
# Select few rows for multiple columns, ["CWDistance", "Height", "Wingspan"]
df.loc[:9, ["CWDistance", "Height", "Wingspan"]]

In [None]:
# Select range of rows for all columns
df.loc[10:15]

The .loc() function requires to arguments, the indices of the rows and the column names you wish to observe.

In the above case **:** specifies all rows, and our column is **CWDistance**. df.loc[**:**,**"CWDistance"**]

Now, let's say we only want to return the first 10 observations:

In [None]:
df.loc[:9, "CWDistance"]

### .iloc()
.iloc() is integer based slicing, whereas .loc() used labels/column names. Here are some examples:

In [None]:
df.iloc[:4]

In [None]:
df.iloc[1:5, 2:4]

In [None]:
df.iloc[1:5, ["Gender", "GenderGroup"]]

We can view the data types of our data frame columns with by calling .dtypes on our data frame:

In [None]:
df.dtypes

The output indicates we have integers, floats, and objects with our Data Frame.

We may also want to observe the different unique values within a specific column, lets do this for Gender:

In [None]:
# List unique values in the df['Gender'] column
df.Gender.unique()

In [None]:
# Lets explore df["GenderGroup] as well
df.GenderGroup.unique()

It seems that these fields may serve the same purpose, which is to specify male vs. female. Lets check this quickly by observing only these two columns:

In [None]:
# Use .loc() to specify a list of mulitple column names
df.loc[:,["Gender", "GenderGroup"]]

From eyeballing the output, it seems to check out.  We can streamline this by utilizing the groupby() and size() functions.

In [None]:
df.groupby(['Gender','GenderGroup']).size()

This output indicates that we have two types of combinations. 

* Case 1: Gender = F & Gender Group = 1 
* Case 2: Gender = M & GenderGroup = 2.  

This validates our initial assumption that these two fields essentially portray the same information.