# Python Libraries

Commonly used Python libraries:

* **Numpy (np)** a library for working with arrays of data

* **Pandas (pd)** provides high-performance, easy-to-use data structures and data analysis tools

* **Scipy** a library of techniques for numerical and scientific computing (only import specific functions)

* **Matplotlib.pyplot (plt)** plotting library for making graphs (only import the **pyplot** module)

* **Seaborn (sns)** a higher-level interface to Matplotlib that can be used to simplify many graphing tasks

* **Statsmodels.api (sm)** provides classes and functions for the estimation of many different statistical models

* **scikit-learn** a machine learning library with many of the same statistical techiques as Statsmodels (only import specific functions) 
  
  
NOTE: libraries can be imported in their entirety (with an alias) or just a portion of the library can be imported.  
* Importing a *library*: **import** numpy as np
* Importing a *module*: **import** matplotlib.pyplot as plt
* Importing a *function*: **from** numpy **import** arange  


### Commonly used sample datasets

Seaborn sample datasets  
https://github.com/mwaskom/seaborn-data  
(used as a target for the seaborn.load_dataset function)

scikit-learn sample datasets (need to be conveted into datafames)  
https://scikit-learn.org/stable/datasets/index.html#toy-datasets  
(note: scikit-learn dataset objects have separate "data", "feature_names", and "target")



### Commonly used functions  

* **pd.read_csv** Pandas function to read *.csv files
* **df.describe()** produces descriptive stats on the numeric variables in a dataframe named "df"
* **df.head()** shows the top of the dataframe




The following code loads the Boston housing data from scikit-learn and creates a Pandas dataframe.

Pandas has a variety of functions named '`read_xxx`' for reading data in different formats.  Right now we will focus on reading '`csv`' files, which stands for comma-separated values. However the other file formats include excel, json, and sql just to name a few.

There are many other options to '`read_csv`' that are very useful.  For example, you would use the option `sep='\t'` instead of the default `sep=','` if the fields of your data file are delimited by tabs instead of commas.  See [here](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) for the full documentation for '`read_csv`'.


NOTE: From this point forward, this notebook needs to be re-written

### String Functions



The next cell shows how to call functions within an imported library:

In [None]:
a = np.array([[1,2],[3,4]]) 
b = np.array([[11,12],[13,14]]) 

np.dot(a,b)

As you can see, we used the dot() function within the numpy library to calculate the dot product of two arrays, a and b.

# Date Functions

Python stores dates and times as the number of seconds from an epoch (a point where the time starts). To find out what the epoch is, look at gmtime(0).

Python Documentation: datetime — Basic date and time types  
https://docs.python.org/3/library/datetime.html  
(see links for specific documentation on “calendar” and “time” functions)  

Python Wiki: Working with Time  
https://wiki.python.org/moin/WorkingWithTime  

Leap year problem  
https://en.wikipedia.org/wiki/Leap_year_problem  
(Some software programs have had problems with how they account for leap years.)  



In [34]:
import time  # imports the time module 
# NOTE: the epoch is usually Jan 1, 1970

var_time1=time.gmtime(0)   # returns the epoch date
var_time2=time.gmtime()    # returns the current date

print(var_time1,"\n",var_time2)

time.struct_time(tm_year=1970, tm_mon=1, tm_mday=1, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=3, tm_yday=1, tm_isdst=0) 
 time.struct_time(tm_year=2020, tm_mon=5, tm_mday=4, tm_hour=18, tm_min=55, tm_sec=37, tm_wday=0, tm_yday=125, tm_isdst=0)


### Creating date variables

In [35]:
import datetime  # imports the datetime module

x = datetime.datetime.now()

print(x)

2020-05-04 13:55:49.272758


In [36]:
import datetime  # imports the datetime module
my_bd = datetime.datetime(1962, 8, 12)
#print(my_bd)
#print(my_bd.month)          # return the month number
print(my_bd.strftime("%B"))  # return the month
print(my_bd.strftime("%A"))  # returns the weekday

August
Sunday


### Viewing Data

In [None]:
# We can view our Data Frame by calling the head() function
df.head()

The head() function simply shows the first 5 rows of our Data Frame.  If we wanted to show the entire Data Frame we would simply write the following:

In [None]:
# Output entire Data Frame
df

As you can see, we have a 2-Dimensional object where each row is an independent observation of our cartwheel data.

To gather more information regarding the data, we can view the column names and data types of each column with the following functions:

In [None]:
df.columns

We can view the data types of our data frame columns with by calling .dtypes on our data frame:

In [None]:
df.dtypes

NOTE: The following is from a UofM notebook

The output indicates we have integers, floats, and objects with our Data Frame.

We may also want to observe the different unique values within a specific column, lets do this for Gender:

In [None]:
# List unique values in the df['Gender'] column
df.Gender.unique()

In [None]:
# Lets explore df["GenderGroup] as well
df.GenderGroup.unique()

It seems that these fields may serve the same purpose, which is to specify male vs. female. Lets check this quickly by observing only these two columns:

In [None]:
# Use .loc() to specify a list of mulitple column names
df.loc[:,["Gender", "GenderGroup"]]

From eyeballing the output, it seems to check out.  We can streamline this by utilizing the groupby() and size() functions.

In [None]:
df.groupby(['Gender','GenderGroup']).size()

This output indicates that we have two types of combinations. 

* Case 1: Gender = F & Gender Group = 1 
* Case 2: Gender = M & GenderGroup = 2.  

This validates our initial assumption that these two fields essentially portray the same information.