## Cover a few other optional (but important) parameters of the pd.Series call
* dtype - default is to infer the data type (int32, float64, str, etc) based on the values in data
    * However, can also explicitly declare the data
    * This can be good if you want to, for example, re-cast the data to save space or to make types compatible
    * But this may also have important negative consequences if not done thoughtfully!


#### Example: change from int to str
* Note that the dtype of series 's' is now an 'object'. This is the Pandas version of a Python 'str'


## First lets set up pandas and make a fake data set to play with





In [0]:
import pandas as pd
import random as random

## Make up some data and corresponding labels to play with

In [0]:
data = [10, 23, 88, 43, 29]
labels = [0,1,2,3,4]

In [0]:
# make a series with the data array from above, but this time make it a str
# instead of the inferred int64 type
s = pd.Series(data, index=labels, dtype='str')

# we're now
# all set to do a bunch of str operations without having to deal with 
# recasting each time we interact with the values in s
print(s[0]=='10')
print(s[0]==10)

### Re-make our series as int64 before moving on because we'll want them to be ints for the next several cells. 




In [0]:
s = pd.Series(data, index=labels, dtype='int64')

## Slicing a Pandas series
* start, stop, step notation from lists...

In [0]:
# first 3 values - notice that you get the label along with the values
print(s[:3])

In [0]:
s[2:-1]    # 3rd entry to len(s)-1

In [0]:
# reverse, etc
s[::-1]

In [0]:
# every other, etc
s[::2]

### Another example using more advanced slicing...
* this is super handy when cleaning data to exclude outliers!


In [0]:
s[s>=20]    #all entries greater than or equal to 30

### Find values within a range

In [0]:
s[(s>=20) & (s<=45)]

### There is also the 'between' method to find values within a range 
* the 'between' method will return True/False depending on whether each entry falls in between the bounds. 
* can then use that index to find values within a range!

In [0]:
ind = s.between(23, 45, inclusive=False)
s[ind]

## Series objects have many built in operations
[list of attributes and methods](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.Series.html)

In [0]:
# shouldn't need to re-run, but make sure that you've got int64 data here (and 
# not str)
s = pd.Series(data, index=labels, dtype='int64')

In [0]:
# attributes
print('Data Type: ', s.dtype)

In [0]:
# basic methods
print('Mean: ', s.mean(), ' Std:', s.std(), 'Max: ', s.max())

In [0]:
# numerical derivative
print('Diff: ', s.diff())

## Find the mean of all values that fall within a range...
* can also apply other methods to compute std, etc after filtering

In [0]:
s[(s>=10) & (s<=45)].mean()

# Pandas DataFrames 
[The official project homepage](https://pandas.pydata.org)

* Goal
    * Extend what we learned about Series objects in the previous tutorial to their 2D counterpart - DataFrames
    * Develop some tools for dealing with missing data (not exhaustive, but a start)

## DataFrames

[Pandas quick start guide for DataFrames](https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe)

* A DataFrame (DF) is a labeled data struture that can be thought of as a 2D extension of the Series objects that we discussed in the first part of the tutorial
* A DF can accept many types of input, multiple Series, a dict of 1D arrays, another DF, etc
* Like a Series, DFs contain data values and their labels. Because we're now dealing with a 2D structure, we call the **row labels the index argument** and the **column labels the column argument**. 
    * Like a Series, if you don't explicitly assign row and column labels, then they will be auto-generated (but not as useful as specifying the labels yourself!)

<div class="alert alert-info">
Much of what we learned about Series objects will generalize to DFs, so here we'll focus on some of key functionality that might not be obvious based on the first part of the tutorial.
</div>

<div class="alert alert-info">
One more quick note: if using an older version of Python (earlier than 3.6) and Pandas (earlier than 0.23) and you create a DF from a dict without explicitly specifying column names, then the column names will be entered into the DF based on lexical order
</div>

## Import libs

In [0]:
# import a generic pandas object and also a few specific functions that we'll use
import pandas as pd 
from google.colab import files

## Upload a file to the /content folder on google colab
* Select the file you want to upload (the csv file that I sent out)
* It will load into your 'contents' folder
* Then you can interact with it just like a normal file on your hardrive



In [0]:
%ls


In [0]:
files.upload()

### Remove unwanted files...

In [0]:
%ls

In [0]:
%rm *.csv

## Make a DataFrame object to hold the contents of the data set
[DataFrame help page](https://pandas.pydata.org/pandas-docs/version/0.21/generated/pandas.DataFrame.html)

* Just like with the pd.Series call, you can specify the data, index labels (row labels in this case)
* In addition to row labels, you can also specify column labels (with 'columns')
* Can also specify data type (default is inferred)
* If you read in the data from a csv file, you will be able to inheret row and column labels (if they are specified in the file). 

In [0]:
# make the call to pd.DataFrames to create the DF - usage much like pd.Series
df = pd.read_csv('annual_temp_csv2.csv')

In [0]:
# take a look at the output...
# compare to print(df) - looks nicer with display thanks to iPython backend 
display(df)   

In [0]:
# another handy display function...good for large dfs that are too big to fit - 
# at least you can get an idea of the overall structure
df.head()

## Get a high-level summary of the data using built-in functionality of DataFrame object
[API reference page](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html)

* What do you notice about the two counts for Year and for Mean

In [0]:
df.describe()

## Just like with Series object, can compute mean, std, etc

In [0]:
df['Year'].mean()

### remember that you can also call by field...I prefer by name like ['Mean'] to avoid confusion with built in methods/functions, but either will work

In [0]:
 df.Mean.std()

### By default, mean, std etc will skip (ignore) missing values (NaNs)
* Sometimes, its good to do a sanity check if you think there are missing values. 
* Can do this by chosing to NOT skip the NaNs...in which case if they exist you'll get back NaN as the answer!
* Then you know that there are NaNs in the data set. 

In [0]:
df['Mean'].mean(skipna=False)