# Cleaning and Transforming Data

## Bringing in Data

* `pd.read_csv('./DOHMH_Dog_Bite_Data.csv')`
    * lots of options
    * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
* `read_excel`
* `read_json`
* etc. (see book)

## Initial Exploration

* show: `head(n)`, `tail(n)`, `sample(n)`
* meta abt columns: `dtype`, `count()`, `info()`
* meta: `describe()`

## Working on Columns

* what data do we have? `dtype`
* count by types... but note, nan is considered float!?
    * `map(lambda x: type(x))`
* what are some actual values... value_counts()
* want to temporarily drop rows with a null in a column?
    * `tmp = df.dropna()`
* sort a Series by index:
    * `sort_index()`

## Altering Display Options

* `pd.option_context('display.max_rows', 500)`

## Work on Specific Columns

btw, bectorized methods / accessors on series:

* use `.dt` or `.str`
* call methods from there

conversions/handling columns

* numeric, but object or str
    * `astype('float64')`... but!!!!
    * map to use arbitrary functions like replace
        * note!!!! `na_action='ignore'`
    * use more sophisticated function (this is tricky)
    * `pd.to_numeric(series, errors='coerce')`
* date, `dt`
    * convert to datetime object
        * `pd.to_datetime(series, errors='coerce')`
    * test it out on a string first
        * `pd.to_datetime('January 02 2015	')`
    * convert all date objects to month name: `.dt.month_name()`
    * convert all date objects to month number `.dt.month`
    * to graph... w/ month names
        * `import calendar`
        * `list(calendar.month_abbr)[1:]` (starts with ''????)
        * pass all to xticks
* `str`
    * `.strip()`
        * `expand=True` to create a data frame
    * `.upper()`
    * `.split()`
    
## Data Set Questions


* what's the average age of the dogs in the data set? oldest, youngest 
    * let's check what the meta info has to say abt this
    * https://data.cityofnewyork.us/Health/DOHMH-Dog-Bite-Data/rsgh-akpg
    * ah... but what r the units??? IDK!!!! 🤔
* when is the worst time of year for dog bites? (so i can go out in bubble)?
* which breed, besides the obvious, are the bite-iest (er, let's do 2nd and / or 3rd place)

In [2]:
import pandas as pd
import numpy as np

In [10]:
df = pd.DataFrame(np.arange(12).reshape((3, 4)), columns=list('abcd'))

In [23]:
df

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,
2,8,9,,11.0


In [24]:
df.loc[1, 'd'] = np.nan
df.loc[2, 'c'] = np.nan

In [25]:
df

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,
2,8,9,,11.0


In [26]:
df.dropna(axis=1)

Unnamed: 0,a,b
0,0,1
1,4,5
2,8,9


In [27]:
df

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,
2,8,9,,11.0


In [28]:
df.fillna(0)

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,0.0
2,8,9,0.0,11.0


In [29]:
df.fillna({'c': 100, 'd': 123})

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,123.0
2,8,9,100.0,11.0


In [30]:
df.fillna(method='ffill')

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,3.0
2,8,9,6.0,11.0


In [31]:
df

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,
2,8,9,,11.0


In [32]:
df.fillna({'c': 100, 'd': df['d'].mean()})


Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,7.0
2,8,9,100.0,11.0
