# Cleaning and Transforming Data

## Bringing in Data

* `pd.read_csv('./DOHMH_Dog_Bite_Data.csv')`
    * lots of options
    * https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
* `read_excel`
* `read_json`
* etc. (see book)

## Initial Exploration

* show: `head(n)`, `tail(n)`, `sample(n)`
* meta abt columns: `dtype`, `count()`, `info()`
* meta: `describe()`

## Working on Columns

* what data do we have? `dtype`
* count by types... but note, nan is considered float!?
    * `map(lambda x: type(x))`
* what are some actual values... value_counts()
* want to temporarily drop rows with a null in a column?
    * `tmp = df.dropna()`
* sort a Series by index:
    * `sort_index()`

## Altering Display Options

* `pd.option_context('display.max_rows', 500)`

## Work on Specific Columns

btw, bectorized methods / accessors on series:

* use `.dt` or `.str`
* call methods from there

conversions/handling columns

* numeric, but object or str
    * `astype('float64')`... but!!!!
    * map to use arbitrary functions like replace
        * note!!!! `na_action='ignore'`
    * use more sophisticated function (this is tricky)
    * `pd.to_numeric(series, errors='coerce')`
* date, `dt`
    * convert to datetime object
        * `pd.to_datetime(series, errors='coerce')`
    * test it out on a string first
        * `pd.to_datetime('January 02 2015	')`
    * convert all date objects to month name: `.dt.month_name()`
    * convert all date objects to month number `.dt.month`
    * to graph... w/ month names
        * `import calendar`
        * `list(calendar.month_abbr)[1:]` (starts with ''????)
        * pass all to xticks
* `str`
    * `.strip()`
        * `expand=True` to create a data frame
    * `.upper()`
    * `.split()`
    
## Data Set Questions


* what's the average age of the dogs in the data set? oldest, youngest 
    * let's check what the meta info has to say abt this
    * https://data.cityofnewyork.us/Health/DOHMH-Dog-Bite-Data/rsgh-akpg
    * ah... but what r the units??? IDK!!!! 🤔
* when is the worst time of year for dog bites? (so i can go out in bubble)?
* which breed, besides the obvious, are the bite-iest (er, let's do 2nd and / or 3rd place)

In [1]:
import pandas as pd
import numpy as np

In [2]:
d = [["$229.2", 2017, 123000, "$1100", "Cupertino, US"],
     ["$211.9", 2017, 320671, "$284", "Suwon, South Korea"],
     ["$177.8", 2017, 566000, "$985",  "Seattle, US"],
     ["$154.7", 2017, 1300000, "$66", "New Taipei City, Taiwan"],
     ["$110.8", 2017, 80110, "$834", "Mountain View, US"]]

comps = ["apple", "samsung", "amazon", "foxconn", "alphabet"]
cols = ["revenue", "fy", "employees", "mcap", "location"]

c = pd.DataFrame(d, index=comps, columns=cols)

In [5]:
c

Unnamed: 0,revenue,fy,employees,mcap,location
apple,$229.2,2017,123000,$1100,"Cupertino, US"
samsung,$211.9,2017,320671,$284,"Suwon, South Korea"
amazon,$177.8,2017,566000,$985,"Seattle, US"
foxconn,$154.7,2017,1300000,$66,"New Taipei City, Taiwan"
alphabet,$110.8,2017,80110,$834,"Mountain View, US"


In [6]:
del c['fy']

In [7]:
c

Unnamed: 0,revenue,employees,mcap,location
apple,$229.2,123000,$1100,"Cupertino, US"
samsung,$211.9,320671,$284,"Suwon, South Korea"
amazon,$177.8,566000,$985,"Seattle, US"
foxconn,$154.7,1300000,$66,"New Taipei City, Taiwan"
alphabet,$110.8,80110,$834,"Mountain View, US"


In [9]:
c.drop('mcap', axis=1)

Unnamed: 0,revenue,employees,location
apple,$229.2,123000,"Cupertino, US"
samsung,$211.9,320671,"Suwon, South Korea"
amazon,$177.8,566000,"Seattle, US"
foxconn,$154.7,1300000,"New Taipei City, Taiwan"
alphabet,$110.8,80110,"Mountain View, US"


In [10]:
c['employees']

apple        123000
samsung      320671
amazon       566000
foxconn     1300000
alphabet      80110
Name: employees, dtype: int64

In [11]:
c.loc['amazon', 'revenue']

'$177.8'

In [14]:
c[['revenue', 'location']]['apple':'amazon']

Unnamed: 0,revenue,location
apple,$229.2,"Cupertino, US"
samsung,$211.9,"Suwon, South Korea"
amazon,$177.8,"Seattle, US"


In [15]:
c.index

Index(['apple', 'samsung', 'amazon', 'foxconn', 'alphabet'], dtype='object')

In [16]:
c

Unnamed: 0,revenue,employees,mcap,location
apple,$229.2,123000,$1100,"Cupertino, US"
samsung,$211.9,320671,$284,"Suwon, South Korea"
amazon,$177.8,566000,$985,"Seattle, US"
foxconn,$154.7,1300000,$66,"New Taipei City, Taiwan"
alphabet,$110.8,80110,$834,"Mountain View, US"


In [17]:
c['state'] = pd.Series(['CA', 'CA', 'WA'], ['apple', 'alphabet', 'amazon'])

In [18]:
c

Unnamed: 0,revenue,employees,mcap,location,state
apple,$229.2,123000,$1100,"Cupertino, US",CA
samsung,$211.9,320671,$284,"Suwon, South Korea",
amazon,$177.8,566000,$985,"Seattle, US",WA
foxconn,$154.7,1300000,$66,"New Taipei City, Taiwan",
alphabet,$110.8,80110,$834,"Mountain View, US",CA


In [19]:
np.nan

nan

In [20]:
type(np.nan)

float

In [21]:
c['employees']

apple        123000
samsung      320671
amazon       566000
foxconn     1300000
alphabet      80110
Name: employees, dtype: int64

In [22]:
c['employees'] // 100000

apple        1
samsung      3
amazon       5
foxconn     13
alphabet     0
Name: employees, dtype: int64

In [23]:
c[c['employees'] < 200000]

Unnamed: 0,revenue,employees,mcap,location,state
apple,$229.2,123000,$1100,"Cupertino, US",CA
alphabet,$110.8,80110,$834,"Mountain View, US",CA


In [24]:
c['state']

apple        CA
samsung     NaN
amazon       WA
foxconn     NaN
alphabet     CA
Name: state, dtype: object

In [26]:
c[c['state'].isnull()]

Unnamed: 0,revenue,employees,mcap,location,state
samsung,$211.9,320671,$284,"Suwon, South Korea",
foxconn,$154.7,1300000,$66,"New Taipei City, Taiwan",


In [28]:
c['state'] = c['state'].fillna('')

In [29]:
c

Unnamed: 0,revenue,employees,mcap,location,state
apple,$229.2,123000,$1100,"Cupertino, US",CA
samsung,$211.9,320671,$284,"Suwon, South Korea",
amazon,$177.8,566000,$985,"Seattle, US",WA
foxconn,$154.7,1300000,$66,"New Taipei City, Taiwan",
alphabet,$110.8,80110,$834,"Mountain View, US",CA


In [30]:
c['employees'].sum()

2389781

In [31]:
c['employees'].mean()

477956.2

In [32]:
c['location'].str.upper()

apple                 CUPERTINO, US
samsung          SUWON, SOUTH KOREA
amazon                  SEATTLE, US
foxconn     NEW TAIPEI CITY, TAIWAN
alphabet          MOUNTAIN VIEW, US
Name: location, dtype: object

In [35]:
tmp = c['location'].str.split(',',  expand=True)

In [36]:
c['country'] = tmp[1]

In [37]:
c

Unnamed: 0,revenue,employees,mcap,location,state,country
apple,$229.2,123000,$1100,"Cupertino, US",CA,US
samsung,$211.9,320671,$284,"Suwon, South Korea",,South Korea
amazon,$177.8,566000,$985,"Seattle, US",WA,US
foxconn,$154.7,1300000,$66,"New Taipei City, Taiwan",,Taiwan
alphabet,$110.8,80110,$834,"Mountain View, US",CA,US


In [38]:
#c.reindex(list(c.index)[1:])

In [39]:
a = [1, 2, 3]
b = [4, 5, 6]


In [40]:
print(a)

[1, 2, 3]


In [41]:
print(*a)

1 2 3


In [42]:
[*a, 7, 8]

[1, 2, 3, 7, 8]

In [43]:
c.reindex([*list(c.index)[1:], 'apple'])

Unnamed: 0,revenue,employees,mcap,location,state,country
samsung,$211.9,320671,$284,"Suwon, South Korea",,South Korea
amazon,$177.8,566000,$985,"Seattle, US",WA,US
foxconn,$154.7,1300000,$66,"New Taipei City, Taiwan",,Taiwan
alphabet,$110.8,80110,$834,"Mountain View, US",CA,US
apple,$229.2,123000,$1100,"Cupertino, US",CA,US


In [45]:
"""
tmp = list(c.index)[1:]
tmp.append('apple')
c.reindex(tmp)
"""

"\ntmp = list(c.index)[1:]\ntmp.append('apple')\nc.reindex(tmp)\n"

In [46]:
rain = pd.DataFrame([[3.50, 4.53, 4.13, 3.98],
                     [7.91, 5.98, 6.10, 5.12],
                     [3.94, 5.28, 3.90, 4.49],
                     [1.42, 0.63, 0.75, 1.65]],
    index=['New York', 'New Orleans', 'Atlanta', 'Seattle'],
    columns=['Jun', 'Jul', 'Aug', 'Sept'])

In [47]:
rain

Unnamed: 0,Jun,Jul,Aug,Sept
New York,3.5,4.53,4.13,3.98
New Orleans,7.91,5.98,6.1,5.12
Atlanta,3.94,5.28,3.9,4.49
Seattle,1.42,0.63,0.75,1.65


In [48]:
rain.apply(lambda arg: type(arg))

Jun     <class 'pandas.core.series.Series'>
Jul     <class 'pandas.core.series.Series'>
Aug     <class 'pandas.core.series.Series'>
Sept    <class 'pandas.core.series.Series'>
dtype: object

In [49]:
rain.apply(lambda month: sum(month), axis=0)


Jun     16.77
Jul     16.42
Aug     14.88
Sept    15.24
dtype: float64

In [50]:
rain.apply(lambda month: sum(month), axis=1)

New York       16.14
New Orleans    25.11
Atlanta        17.61
Seattle         4.45
dtype: float64

In [51]:
rain.apply(lambda month: max(month) - min(month), axis=1)

New York       1.03
New Orleans    2.79
Atlanta        1.38
Seattle        1.02
dtype: float64

In [52]:
rain.sum()

Jun     16.77
Jul     16.42
Aug     14.88
Sept    15.24
dtype: float64

In [53]:
pd.Series(['ant', 'cat', 'bat']).map(lambda word: word + 's')

0    ants
1    cats
2    bats
dtype: object

## DataFrame

* apply ... call a function on every row or col
* applymap ... call a function on every element 

## Series

* map ... call a function on every element

In [56]:
help(rain.describe)

Help on method describe in module pandas.core.generic:

describe(percentiles=None, include=None, exclude=None) method of pandas.core.frame.DataFrame instance
    Generate descriptive statistics that summarize the central tendency,
    dispersion and shape of a dataset's distribution, excluding
    ``NaN`` values.
    
    Analyzes both numeric and object series, as well
    as ``DataFrame`` column sets of mixed data types. The output
    will vary depending on what is provided. Refer to the notes
    below for more detail.
    
    Parameters
    ----------
    percentiles : list-like of numbers, optional
        The percentiles to include in the output. All should
        fall between 0 and 1. The default is
        ``[.25, .5, .75]``, which returns the 25th, 50th, and
        75th percentiles.
    include : 'all', list-like of dtypes or None (default), optional
        A white list of data types to include in the result. Ignored
        for ``Series``. Here are the options:
    
    

In [57]:
names = ['a', 'b', 'c']

In [59]:
names[1:] + [names[0]]

['b', 'c', 'a']

In [60]:
[*list(names)[1:], 'a']

['b', 'c', 'a']

In [61]:
df = pd.DataFrame(np.arange(12).reshape((3, 4)), 
	columns=list('abcd'))
df.loc[1, 'd'] = np.nan
df.loc[2, 'c'] = np.nan

In [62]:
df

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,
2,8,9,,11.0


In [63]:
df.dropna()

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0


In [64]:
df

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,
2,8,9,,11.0


In [65]:
df.dropna(axis=1)

Unnamed: 0,a,b
0,0,1
1,4,5
2,8,9


In [66]:
df

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,
2,8,9,,11.0


In [67]:
df['d'].notnull()

0     True
1    False
2     True
Name: d, dtype: bool

In [68]:

df[df['d'].notnull()]

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
2,8,9,,11.0


In [69]:
df.fillna(-1)

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,-1.0
2,8,9,-1.0,11.0


In [70]:
df.fillna({'c':100, 'd':234})

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,234.0
2,8,9,100.0,11.0


In [71]:
df.fillna({'c':100, 'd':df['d'].mean()})

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,7.0
2,8,9,100.0,11.0


In [72]:
df.fillna(method='ffill')

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,3.0
2,8,9,6.0,11.0


In [74]:
df.dtypes

a      int64
b      int64
c    float64
d    float64
dtype: object

In [75]:
df

Unnamed: 0,a,b,c,d
0,0,1,2.0,3.0
1,4,5,6.0,
2,8,9,,11.0


In [78]:
df.count()

a    3
b    3
c    2
d    2
dtype: int64

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 4 columns):
a    3 non-null int64
b    3 non-null int64
c    2 non-null float64
d    2 non-null float64
dtypes: float64(2), int64(2)
memory usage: 224.0 bytes


In [80]:
df['a'].astype('float64')

0    0.0
1    4.0
2    8.0
Name: a, dtype: float64

In [81]:
df['d'].astype('float64')

0     3.0
1     NaN
2    11.0
Name: d, dtype: float64

In [116]:
df = pd.read_csv('./DOHMH_Dog_Bite_Data (1).csv')

In [83]:
df

Unnamed: 0,UniqueID,DateOfBite,Species,Breed,Age,Gender,SpayNeuter,Borough,ZipCode
0,1,January 02 2015,DOG,"Poodle, Standard",3,M,True,Brooklyn,11238
1,2,January 02 2015,DOG,HUSKY,,U,False,Brooklyn,11249
2,3,January 02 2015,DOG,,,U,False,Brooklyn,
3,4,January 01 2015,DOG,American Pit Bull Terrier/Pit Bull,6,M,False,Brooklyn,11221
4,5,January 03 2015,DOG,American Pit Bull Terrier/Pit Bull,1,M,False,Brooklyn,11207
...,...,...,...,...,...,...,...,...,...
10275,10276,December 24 2017,DOG,CHIWEENIE MIX,7,M,True,Staten Island,10303
10276,10277,December 24 2017,DOG,DUNKER,5,F,True,Staten Island,10303
10277,10278,December 21 2017,DOG,"Schnauzer, Miniature",10M,M,True,Staten Island,10312
10278,10279,December 28 2017,DOG,Mixed/Other,,F,False,Staten Island,10308


In [84]:
df.head(5)

Unnamed: 0,UniqueID,DateOfBite,Species,Breed,Age,Gender,SpayNeuter,Borough,ZipCode
0,1,January 02 2015,DOG,"Poodle, Standard",3.0,M,True,Brooklyn,11238.0
1,2,January 02 2015,DOG,HUSKY,,U,False,Brooklyn,11249.0
2,3,January 02 2015,DOG,,,U,False,Brooklyn,
3,4,January 01 2015,DOG,American Pit Bull Terrier/Pit Bull,6.0,M,False,Brooklyn,11221.0
4,5,January 03 2015,DOG,American Pit Bull Terrier/Pit Bull,1.0,M,False,Brooklyn,11207.0


In [85]:
df.tail(5)

Unnamed: 0,UniqueID,DateOfBite,Species,Breed,Age,Gender,SpayNeuter,Borough,ZipCode
10275,10276,December 24 2017,DOG,CHIWEENIE MIX,7,M,True,Staten Island,10303
10276,10277,December 24 2017,DOG,DUNKER,5,F,True,Staten Island,10303
10277,10278,December 21 2017,DOG,"Schnauzer, Miniature",10M,M,True,Staten Island,10312
10278,10279,December 28 2017,DOG,Mixed/Other,,F,False,Staten Island,10308
10279,10280,December 29 2017,DOG,BOXER/PIT BULL,,M,False,Staten Island,10314


In [86]:
df.sample(10)

Unnamed: 0,UniqueID,DateOfBite,Species,Breed,Age,Gender,SpayNeuter,Borough,ZipCode
8149,8150,July 19 2016,DOG,Pit Bull,,U,False,Queens,11694.0
8739,8740,May 14 2017,DOG,NEWFOUNDLAND OR TIBETAN MASTIFF,,U,False,Queens,11364.0
10247,10248,November 15 2017,DOG,Pointer,,U,False,Staten Island,
7841,7842,March 28 2016,DOG,"Poodle, Miniature",17.0,M,False,Queens,11411.0
1020,1021,May 20 2016,DOG,Mixed/Other,5.0,M,False,Brooklyn,11216.0
9859,9860,August 16 2016,DOG,Mixed/Other,,U,False,Staten Island,10314.0
1820,1821,June 13 2017,DOG,TERRIER,,U,False,Brooklyn,11216.0
8925,8926,May 16 2017,DOG,Pit Bull,2.0,F,True,Queens,11418.0
169,170,April 19 2015,DOG,ITALIAN MASTIFF,3.0,F,False,Brooklyn,11221.0
7750,7751,February 14 2016,DOG,Golden Retriever,2.0,M,True,Queens,11385.0


In [87]:
df.dtypes

UniqueID       int64
DateOfBite    object
Species       object
Breed         object
Age           object
Gender        object
SpayNeuter      bool
Borough       object
ZipCode       object
dtype: object

In [88]:
df.count()

UniqueID      10280
DateOfBite    10280
Species       10280
Breed          8692
Age            5534
Gender        10280
SpayNeuter    10280
Borough       10280
ZipCode        7613
dtype: int64

In [89]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10280 entries, 0 to 10279
Data columns (total 9 columns):
UniqueID      10280 non-null int64
DateOfBite    10280 non-null object
Species       10280 non-null object
Breed         8692 non-null object
Age           5534 non-null object
Gender        10280 non-null object
SpayNeuter    10280 non-null bool
Borough       10280 non-null object
ZipCode       7613 non-null object
dtypes: bool(1), int64(1), object(7)
memory usage: 652.7+ KB


In [90]:
df.describe()

Unnamed: 0,UniqueID
count,10280.0
mean,5140.5
std,2967.724718
min,1.0
25%,2570.75
50%,5140.5
75%,7710.25
max,10280.0


In [91]:
df['Age']

0          3
1        NaN
2        NaN
3          6
4          1
        ... 
10275      7
10276      5
10277    10M
10278    NaN
10279    NaN
Name: Age, Length: 10280, dtype: object

In [92]:
df['Age'].value_counts()

2                          835
3                          748
1                          665
4                          556
5                          514
                          ... 
2018-08-09T00:00:00.000      1
2018-04-02T00:00:00.000      1
10.5                         1
10y                          1
2018-09-10T00:00:00.000      1
Name: Age, Length: 153, dtype: int64

In [93]:
df['Age'].map(lambda age: type(age)).value_counts()

<class 'str'>      5534
<class 'float'>    4746
Name: Age, dtype: int64

In [94]:
df['Age'].isnull().value_counts()

False    5534
True     4746
Name: Age, dtype: int64

In [95]:
df['Age'].astype('float64')

ValueError: could not convert string to float: '7M'

In [96]:
df['Age'].value_counts()

2                          835
3                          748
1                          665
4                          556
5                          514
                          ... 
2018-08-09T00:00:00.000      1
2018-04-02T00:00:00.000      1
10.5                         1
10y                          1
2018-09-10T00:00:00.000      1
Name: Age, Length: 153, dtype: int64

In [98]:
with pd.option_context('display.max_rows', 200):
    print(df['Age'].value_counts())

2                          835
3                          748
1                          665
4                          556
5                          514
6                          386
7                          324
8                          286
9                          167
10                         159
11                         101
12                          69
13                          57
14                          33
8M                          32
4M                          31
10M                         30
11M                         28
2Y                          27
3M                          26
3Y                          24
15                          22
9M                          22
5M                          21
1Y                          20
7M                          20
6M                          18
4Y                          16
6Y                          15
5Y                          13
2M                          12
16                          12
10Y     

In [108]:
def normalize_age(s):
    if isinstance(s, float):
        return s
    
    month_endings = ['M', 'MTHS', 'm']
    year_endings = ['Y', 'YRS']
    
    for ending in month_endings + year_endings:
        age = s.replace(ending, '').strip()
        if age.isnumeric():
            if ending.endswith(s):
                return str(int(age) / 12)
            elif ending.endswith(s):
                return age
        
    return s

In [109]:
normalize_age(np.nan)

nan

In [110]:

normalize_age('5m')

'0.4166666666666667'

In [111]:
df['Age'].map(normalize_age)


0                       0.25
1                        NaN
2                        NaN
3                        0.5
4        0.08333333333333333
                ...         
10275     0.5833333333333334
10276     0.4166666666666667
10277     0.8333333333333334
10278                    NaN
10279                    NaN
Name: Age, Length: 10280, dtype: object

In [117]:
df['Age'].map(normalize_age).value_counts()

0.16666666666666666        849
0.25                       783
0.08333333333333333        667
0.3333333333333333         596
0.4166666666666667         538
                          ... 
8WKS                         1
10y                          1
2018-08-09T00:00:00.000      1
2018-08-02T00:00:00.000      1
1.5833333333333333           1
Name: Age, Length: 104, dtype: int64

In [118]:
df['Age'] = df['Age'].map(normalize_age)

In [120]:
df['Age'].value_counts()

0.16666666666666666        849
0.25                       783
0.08333333333333333        667
0.3333333333333333         596
0.4166666666666667         538
                          ... 
8WKS                         1
10y                          1
2018-08-09T00:00:00.000      1
2018-08-02T00:00:00.000      1
1.5833333333333333           1
Name: Age, Length: 104, dtype: int64

In [124]:
df['Age'] = pd.to_numeric(df['Age'], errors='coerce')

In [125]:
df.describe()

Unnamed: 0,UniqueID,Age
count,10280.0,5467.0
mean,5140.5,0.56659
std,2967.724718,1.106275
min,1.0,0.083333
25%,2570.75,0.166667
50%,5140.5,0.333333
75%,7710.25,0.583333
max,10280.0,15.5


In [126]:
df['Breed'].value_counts()

Pit Bull                                1921
Shih Tzu                                 364
American Pit Bull Terrier/Pit Bull       349
Chihuahua                                344
American Pit Bull Mix / Pit Bull Mix     340
                                        ... 
WEST HIGHLAND TERR                         1
ROTWEILLER/LABRADOR RETRIEVER              1
CHOW/SHEPHERD X                            1
SIBERIAN PIT BULL                          1
ROTTWEILER/BOXER                           1
Name: Breed, Length: 1032, dtype: int64