# Imports

In [29]:
# Always import pandas as pd, standard way of importing. 
import pandas as pd

# I have a custom script in this dir which stores ('pickles') python
# objects into files. 
import pickle_funcs as pk 

# Pickle Functions


Pickle library allows for Python objects to be saved as files, which can be handy sometimes when working with spreadsheets. I've written a small script which uses the library and makes this easier. 

Use the command

```Python
pk.pickle_object(any_python_object, 'filename', test=True)
```

to pickle any_python_object as 'filename.pickle' (don't include the extension). By default the function makes sure that the object was correctly pickeled by reloading it and comparing the reloaded version to the original. Sometimes the test doesn't work, e.g. when comparing two Pandas dataframes, therefore it can be disabeled by using `test=False`. 

To load (unpickle) an object: 

```Python
loaded_object = pk.unpickle_object('filename')
```

Note that it isn't necessary to type the extentions after `filename.` This simply loads the saved object into the script. 

# Loading a CSV File

I have a sample csv file which is an export from NOAA weather data. It has been cut down in length to serve as a better example. 

To load a CSV into a [Pandas Dataframe](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html), (a dataframe is the central Pandas object type, it is sort of like a spreadsheet), use the [pd.read_csv()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html) function as below:

In [40]:
# Note that calling the main dataframe 'df' is a defacto standard.
df = pd.read_csv('sample_data.csv')

In [6]:
df.describe()

Unnamed: 0,ELEVATION,LATITUDE,LONGITUDE,DATE,PRCP,Time of Observation,SNWD,Time of Observation.1,SNOW,Time of Observation.2
count,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0,1064.0
mean,766.0,37.737,-100.0294,20075720.0,0.064793,9999.0,-9218.890508,9999.0,-253.687124,9999.0
std,0.0,4.336349e-13,4.976138e-13,9708.247,0.263675,0.0,2683.213417,0.0,1573.230999,0.0
min,766.0,37.737,-100.0294,20060600.0,0.0,9999.0,-9999.0,9999.0,-9999.0,9999.0
25%,766.0,37.737,-100.0294,20070320.0,0.0,9999.0,-9999.0,9999.0,0.0,9999.0
50%,766.0,37.737,-100.0294,20080110.0,0.0,9999.0,-9999.0,9999.0,0.0,9999.0
75%,766.0,37.737,-100.0294,20081010.0,0.0,9999.0,-9999.0,9999.0,0.0,9999.0
max,766.0,37.737,-100.0294,20090800.0,3.28,9999.0,12.0,9999.0,10.0,9999.0


I've loaded the dataframe and use the [describe()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html?highlight=dataframe%20describe#pandas.DataFrame.describe) method to show the columns and some basic data about each column.

# Basic dataframe attributes 

Need to memorize the #bk pandas cheatsheet pdf. 

# Writing a CSV File

Call the [to_csv](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html) method on a ddf to write to csv. 

In [43]:
with open('output.csv', 'w+') as f:
    f.write(df.to_csv())

# Misc Functions 

## Sort a DataFrame

The method [sort_values()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html) can be called on a dataframe.  By default it sorts by the values of specified column. Can set `inplace=True` when calling to modify the df, otherwise it will return a new dataframe.

In [11]:
df.sort_values(by=['PRCP'], inplace=True)

## View Length of a DataFrame


Simply use the len() function. Returns length of df, excluding header row. 

In [12]:
len(df)

1064

## Replace certain values in DF 

Use the [replace()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.replace.html) method. The `value='xxx'` will replace all values in the replace_list. 

In [42]:
replace_list = [
    '-9999.00', -9999.00, '-9999.0', -9999.0, '-9999',
    -9999, '9999', 9999
]

df = df.replace(to_replace=replace_list, value='NaN');

## Drop columns in df

Use the [drop()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html) method to drop certain columns, based on name. Must use axis=1 in order to drop columns. 

In [14]:
drop_values = [
    'ELEVATION',
    'STATION',
    'LATITUDE',
    'LONGITUDE'
]
df.drop(drop_values, inplace=True, axis=1)

## Select certain columns in df

Simply use column selection and set df equal to new df. 

In [16]:
columns = [
    'DATE',
    'PRCP',
    'SNWD'
]
df = df[columns]

## Drop certain rows in a df based on value

One way of doing it is just filter for certain values. 

In [31]:
len(df)

1064

In [44]:
df = df[df.SNWD != 0] 
df = df[df.SNWD != 'NaN'] 
# Drop rows in df which have zero vales in SNWD

In [45]:
len(df)

35

There may very well be a better way of doing this. Need to reserach 

## Force types on column

In [49]:
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'year': [2012, 2012, 2013, 2014, 2014],
        'reports': ['04', '024', 31, 2, 3]}
df2 = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df2


Unnamed: 0,name,reports,year
Cochice,Jason,4,2012
Pima,Molly,24,2012
Santa Cruz,Tina,31,2013
Maricopa,Jake,2,2014
Yuma,Amy,3,2014


In [50]:
df2.describe()

Unnamed: 0,year
count,5.0
mean,2013.0
std,1.0
min,2012.0
25%,2012.0
50%,2013.0
75%,2014.0
max,2014.0


Here, the reports column isn't taken as a column of numbers because of the leading zeros. Can use the [astype()](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.astype.html) method on the df to force to a number on the `reports` column and to change the `year` column to a column of strings. 

In [51]:
df2['reports'] = df2['reports'].astype(int)
df2['year'] = df2['year'].astype(str)

In [53]:
df2.describe()

Unnamed: 0,reports
count,5.0
mean,12.8
std,13.663821
min,2.0
25%,3.0
50%,4.0
75%,24.0
max,31.0


# Notes


* first going through current repos and adding whatever appears useful.