# Welcome to the Dark Art of Coding:
## Introduction to Python
Data Handling

<img src='../images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* Merge DataFrames effectively
* Unstack your data
* Replace unwanted data with better versions

# Merging data
---

In [2]:
import pandas as pd
from pandas import DataFrame, Series

We start off with two data sets. One is shorter than the other but generally they're similar. Both have a column with names and a column with countries

In [3]:
readers1 = pd.read_csv('reader_stats.csv')
readers2 = pd.read_csv('reader_stats_short.csv')

print("readers1 data:", '\n', readers1)
print('-' * 40)
print("readers2 data:", '\n', readers2)

readers1 data: 
     reader country
0   claude      fi
1   pierre      jp
2  vincent      hk
3    henri      hk
4    lilla      fi
5      eva      fi
6     anna      fi
7     olga      zw
----------------------------------------
readers2 data: 
     reader country
0   claude      hk
1   pierre      jp
2  vincent      jp
3    marie      py


Built in functions of pandas let us merge two data frames together in multiple different ways. These merges are similar to the ones you might see in a SQL database

In [4]:
readerso = pd.merge(readers1, readers2, how='outer')
readersi = pd.merge(readers1, readers2, how='inner')
readersl = pd.merge(readers1, readers2, how='left')
readersr = pd.merge(readers1, readers2, how='right')

<img src='Base.jpg' width='300' style='float:center'>

In [5]:
print('Outer Join\n')
readerso

Outer Join



Unnamed: 0,reader,country
0,claude,fi
1,pierre,jp
2,vincent,hk
3,henri,hk
4,lilla,fi
5,eva,fi
6,anna,fi
7,olga,zw
8,claude,hk
9,vincent,jp


<img src='Outer.jpg'  width='300' style='float:center'>

In [6]:
print('Inner Join\n')
readersi

Inner Join



Unnamed: 0,reader,country
0,pierre,jp


<img src='Inner.jpg'  width='300' style='float:center'>

In [7]:
print('Left Join\n')
readersl

Left Join



Unnamed: 0,reader,country
0,claude,fi
1,pierre,jp
2,vincent,hk
3,henri,hk
4,lilla,fi
5,eva,fi
6,anna,fi
7,olga,zw


<img src='Left.jpg'  width='300' style='float:center'>

In [8]:
print('Right Join\n')
readersr

Right Join



Unnamed: 0,reader,country
0,pierre,jp
1,claude,hk
2,vincent,jp
3,marie,py


<img src='Right.jpg'  width='300' style='float:center'>

NOTE: Please be aware, that unless you specify otherwise, these joins are based on the contents of the entire row. In many cases, we simply want to join based on the contents of one or more columns.

# Key columns
Remember, DataFrames can be built from dictionaries, using the keys of the dictionary as the source of the column in the DataFrame. Any elements (stored as a sequence) in the values associated with those keys then become the elements in the respective column

Here, we are creating some **key** columns that we can use to create joins...

In [9]:
dfa = DataFrame({'key':     ['bruce', 'bruce', 'diana', 'bruce', 'hal', 'diana', 'kara'],
                 'emails_left': [112, 111, 201, 109, 113, 203, 204]}) 

dfb = DataFrame({'key':        ['hal', 'bruce', 'selina', 'diana'],
                 'ages_right': [36, 37, 33, 34]})

In [10]:
# Imagine, that using the previous data, we wanted to do an analysis of emails versus
# age (i.e. whether age impacts the number of emails someone receives over time).
# Let's start with a Left Join: 

dfl = pd.merge(dfa, dfb, on='key', how='left')
dfl

Unnamed: 0,emails_left,key,ages_right
0,112,bruce,37.0
1,111,bruce,37.0
2,201,diana,34.0
3,109,bruce,37.0
4,113,hal,36.0
5,203,diana,34.0
6,204,kara,


In [11]:
# Now, let's look at an Inner Join:

dfi = pd.merge(dfa, dfb, on='key', how='inner')

In [12]:
dfi

Unnamed: 0,emails_left,key,ages_right
0,112,bruce,37
1,111,bruce,37
2,109,bruce,37
3,201,diana,34
4,203,diana,34
5,113,hal,36


## Multiple key columns

In [13]:
# Here, again, we create a set of DataFrames based on dictionaries.
# This time we choose to use more than one column that will be used as keys to
# match data in each of the DataFrames.


dfa = DataFrame({'fname_key': ['bruce', 'bruce', 'hal', 'selina', 'hal'],
                 'lname_key': ['wayne', 'jordan', 'wayne', 'kyle', 'jordan'],
                 'ages_left': [37, 53, 54, 33, 36]})

dfb = DataFrame({'fname_key': ['hal', 'bruce', 'hal', 'kara', 'hal'],
                 'lname_key': ['jordan', 'wayne', 'jordan', 'zor-el', 'jordan'],
                 'emails_right': [189, 111, 193, 253, 187]})

# Outer Join 
dfo = pd.merge(dfa, dfb, on=['fname_key', 'lname_key'], how='outer')
dfo

Unnamed: 0,ages_left,fname_key,lname_key,emails_right
0,37.0,bruce,wayne,111.0
1,53.0,bruce,jordan,
2,54.0,hal,wayne,
3,33.0,selina,kyle,
4,36.0,hal,jordan,189.0
5,36.0,hal,jordan,193.0
6,36.0,hal,jordan,187.0
7,,kara,zor-el,253.0


In [14]:
# Inner Join
dfi = pd.merge(dfa, dfb, on=['fname_key', 'lname_key'], how='inner')
dfi

Unnamed: 0,ages_left,fname_key,lname_key,emails_right
0,37,bruce,wayne,111
1,36,hal,jordan,189
2,36,hal,jordan,193
3,36,hal,jordan,187


# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_merge_01.py```

Execute your script in the **IPython interpreter** using the command:

```bash
run my_merge_01.py```

Your script should do the following:
* Read in two csv files (Don't worry about column names. the files have a header row that is turned into column names for you):
    * `left_file.csv`
    * `right_file.csv`
* Merge the two DataFrames using the `name` column as the key and using an inner join
* Create a new column called `matchip` of True/False values where the `toip` column and the `fromip` column match
* Output just the rows where `matchip` is True

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../images/green_sticky.300px.png' width='200' style='float:left'>

In [15]:
dfl = pd.read_csv('left_file.csv')
dfr = pd.read_csv('right_file.csv')
comb = pd.merge(dfl, dfr, on='name', how='inner')
comb['matchip'] = comb.fmip == comb.toip
comb[comb.matchip]

Unnamed: 0,name,toip,datetime_x,long,payload,email,fmip,datetime_y,lat,matchip
8,barbara gordon,102.86.56.199,2015-09-08T11:19:08,9.238479,410269,bgordon@jleague.org,102.86.56.199,2016-01-12T00:08:31,42.208347,True
10,dick grayson,75.122.133.75,2015-12-10T02:50:03,22.926502,837516,dgrayson@jleague.org,75.122.133.75,2016-01-29T22:26:49,48.263759,True
23,victor stone,106.152.114.248,2015-12-06T02:56:46,16.827302,971396,vstone@jleague.org,106.152.114.248,2015-11-28T03:24:52,37.282635,True


# Concatenation
---

Pandas Series/DataFrames (like some of the other data we've handled) can concatenate. However instead of using the `+` like with lists or strings. You have to use Pandas built in function `pd.concat()`. The default behaviour is to stack the data end to end

In [16]:
names1 = Series(['wayne', 'jordan'], index=[1, 2])
names2 = Series(['dinah', 'kent'], index=[1, 2])
names3 = Series(['rayner', 'gordon', 'grayson'], index=[6, 7, 8])

pd.concat([names1, names3, names2], axis=0)

# An alternate method is to stack columns side by side
# pd.concat([names1, names3, names2], axis= 1)

1      wayne
2     jordan
6     rayner
7     gordon
8    grayson
1      dinah
2       kent
dtype: object

In [17]:
names4 = pd.concat([names1, names3])
pd.concat([names1, names4], axis=1)

Unnamed: 0,0,1
1,wayne,wayne
2,jordan,jordan
6,,rayner
7,,gordon
8,,grayson


In [18]:
output = pd.concat([names1, names3, names3], keys=['rho', 'sigma', 'tau'])
output

rho    1      wayne
       2     jordan
sigma  6     rayner
       7     gordon
       8    grayson
tau    6     rayner
       7     gordon
       8    grayson
dtype: object

In [19]:
output = pd.concat([names1, names3, names3], axis=1, keys=['rho', 'sigma', 'tau'])
output

Unnamed: 0,rho,sigma,tau
1,wayne,,
2,jordan,,
6,,rayner,rayner
7,,gordon,gordon
8,,grayson,grayson


In [20]:
# To prep our next data set, we'll use yet another way to generate DataFrames...
# These nested lists will become the rows in our DataFrame
# AS a reminder, you can assign columns when you generate the Frame
# If you don't have any need for the original indexes, you can ignore
# them and pandas will auto-generate an brand-new index on the fly when you do a 
# concatenation.

dfa = DataFrame([[11, 21, 31, 41],
                 [13, 25, 32, 49],
                 [11, 21, 31, 41],
                 [11, 21, 31, 42]], columns=['iota', 'kappa', 'lambda', 'mu'])

dfb = DataFrame([[55, 66, 77],
                 [53, 63, 73]], columns=['kappa', 'lambda', 'mu'])

print(dfa)
print(dfb)

kevin = pd.concat([dfa, dfb], ignore_index=True)

   iota  kappa  lambda  mu
0    11     21      31  41
1    13     25      32  49
2    11     21      31  41
3    11     21      31  42
   kappa  lambda  mu
0     55      66  77
1     53      63  73


In [21]:
kevin
kevin.dtypes


iota      float64
kappa       int64
lambda      int64
mu          int64
dtype: object

# Unstacking
---

In [22]:
# When generating DataFrames, another common method, especially with ranges of
# data OR with randomized data is to use functions in numpy to seed
# the Frame with ranges and/or randomized values. 
# Here, we are creating a Frame with the numbers 100 to 114 and shaping it to be a 
# three by five table.

import numpy as np

df = DataFrame(np.arange(100, 115).reshape((3, 5)),
               index=pd.Index(['kara', 'dinah', 'selina'], name='justiceleague'),
               columns=pd.Index(['wed', 'thu', 'fri', 'sat', 'sun'], name='day'))
df

day,wed,thu,fri,sat,sun
justiceleague,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
kara,100,101,102,103,104
dinah,105,106,107,108,109
selina,110,111,112,113,114


In [23]:
# The default level to unstack is the innermost
df.unstack()

day  justiceleague
wed  kara             100
     dinah            105
     selina           110
thu  kara             101
     dinah            106
     selina           111
fri  kara             102
     dinah            107
     selina           112
sat  kara             103
     dinah            108
     selina           113
sun  kara             104
     dinah            109
     selina           114
dtype: int64

In [24]:
s = df.unstack()
s['wed']

justiceleague
kara      100
dinah     105
selina    110
dtype: int64

In [25]:
# You can refer to the level to unstack by an integer number, starting
# with the farthest left being noted as 0. By default, pandas unstacks from the 
# innermost level of a multi-level hierarchical index.

# The following code comes directly from the pandas documentation:
# http://pandas.pydata.org/pandas-docs/stable/advanced.html
# Several take-aways for this code... 
#     * use the documentation > plenty of great examples are in there.
#     * This MultiIndex dataframe is a nice setup for demoing multilevel unstacking
 
'''
In [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
   ...:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]
   ...: 
In [2]: tuples = list(zip(*arrays))
In [3]: tuples
Out[3]: 
[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]
In [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
In [5]: index
Out[5]: 
MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])
In [6]: s = pd.Series(np.random.randn(8), index=index)
In [7]: s
Out[7]: 
first  second
bar    one       0.469112
       two      -0.282863
baz    one      -1.509059
       two      -1.135632
foo    one       1.212112
       two      -0.173215
qux    one       0.119209
       two      -1.044236
dtype: float64
'''

"\nIn [1]: arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],\n   ...:           ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]\n   ...: \nIn [2]: tuples = list(zip(*arrays))\nIn [3]: tuples\nOut[3]: \n[('bar', 'one'),\n ('bar', 'two'),\n ('baz', 'one'),\n ('baz', 'two'),\n ('foo', 'one'),\n ('foo', 'two'),\n ('qux', 'one'),\n ('qux', 'two')]\nIn [4]: index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])\nIn [5]: index\nOut[5]: \nMultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],\n           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],\n           names=['first', 'second'])\nIn [6]: s = pd.Series(np.random.randn(8), index=index)\nIn [7]: s\nOut[7]: \nfirst  second\nbar    one       0.469112\n       two      -0.282863\nbaz    one      -1.509059\n       two      -1.135632\nfoo    one       1.212112\n       two      -0.173215\nqux    one       0.119209\n       two      -1.044236\ndtype: float64\n"

In [26]:
arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'],
          ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']]

tuples = list(zip(*arrays))

In [27]:
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
index

MultiIndex(levels=[['bar', 'baz', 'foo', 'qux'], ['one', 'two']],
           labels=[[0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 0, 1, 0, 1, 0, 1]],
           names=['first', 'second'])

In [28]:
s = pd.Series(np.random.randn(8), index=index)

In [29]:
# Using the example above, it is possible to demonstrate several levels of unstacking.
# As noted, the default level of unstacking is to unstack from the innermost level
# of a MultiIndex. Levels are numbered started at the outermost level being '0' and 
# incrementing as they move inward.

s

first  second
bar    one      -1.548336
       two       0.481863
baz    one       0.105158
       two      -0.702698
foo    one       1.066325
       two      -1.308268
qux    one      -1.636488
       two       0.849723
dtype: float64

In [30]:
s.unstack(1)

second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-1.548336,0.481863
baz,0.105158,-0.702698
foo,1.066325,-1.308268
qux,-1.636488,0.849723


In [31]:
s.unstack(0)

first,bar,baz,foo,qux
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
one,-1.548336,0.105158,1.066325,-1.636488
two,0.481863,-0.702698,-1.308268,0.849723


In [32]:
# NOTE: You can refer to the level to unstack by the name of the Index.

s.unstack('second')

second,one,two
first,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,-1.548336,0.481863
baz,0.105158,-0.702698
foo,1.066325,-1.308268
qux,-1.636488,0.849723


# Pivot table
---

In [33]:
# Another great tool for looking at your data in more convenient ways is to use a 
# pivot table. Let's start with a DataFrame that has three columns based on 
# this list of lists. A timestamp, a Justice League hero and the number of 
# Tweets they received on a given day.

league = DataFrame([['2016-03-10T00:00:00', 'jordan', 221],
                    ['2016-03-10T00:00:00', 'wayne', 222],
                    ['2016-03-10T00:00:00', 'kyle', 345],
                    ['2016-03-11T00:00:00', 'jordan', 222],
                    ['2016-03-11T00:00:00', 'wayne', 223],
                    ['2016-03-11T00:00:00', 'kyle', 323],
                    ['2016-03-12T00:00:00', 'jordan', 201],
                    ['2016-03-12T00:00:00', 'wayne', 209],
                    ['2016-03-12T00:00:00', 'kyle', 340],
                    ['2016-03-13T00:00:00', 'jordan', 220],
                    ['2016-03-13T00:00:00', 'wayne', 223],
                    ['2016-03-13T00:00:00', 'kyle', 339],
                    ['2016-03-14T00:00:00', 'jordan', 201],
                    ['2016-03-14T00:00:00', 'wayne', 219],
                    ['2016-03-14T00:00:00', 'kyle', 345]],
                    columns=['timestamp', 'jleague', 'tweets'])

In [34]:
# From the league DataFrame, we can create a pivot table using the pivot() command:

tweet_view = league.pivot('timestamp', 'jleague', 'tweets')
tweet_view

jleague,jordan,kyle,wayne
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-03-10T00:00:00,221,345,222
2016-03-11T00:00:00,222,323,223
2016-03-12T00:00:00,201,340,209
2016-03-13T00:00:00,220,339,223
2016-03-14T00:00:00,201,345,219


In [35]:
league['fan_index'] = abs(np.random.randn(len(league)))
league

Unnamed: 0,timestamp,jleague,tweets,fan_index
0,2016-03-10T00:00:00,jordan,221,0.450131
1,2016-03-10T00:00:00,wayne,222,0.253846
2,2016-03-10T00:00:00,kyle,345,1.300827
3,2016-03-11T00:00:00,jordan,222,0.334536
4,2016-03-11T00:00:00,wayne,223,1.814572
5,2016-03-11T00:00:00,kyle,323,2.028574
6,2016-03-12T00:00:00,jordan,201,0.384711
7,2016-03-12T00:00:00,wayne,209,0.652716
8,2016-03-12T00:00:00,kyle,340,0.004608
9,2016-03-13T00:00:00,jordan,220,0.343874


In [36]:
tweet_view2 = league.pivot('timestamp', 'jleague')
tweet_view2

Unnamed: 0_level_0,tweets,tweets,tweets,fan_index,fan_index,fan_index
jleague,jordan,kyle,wayne,jordan,kyle,wayne
timestamp,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2016-03-10T00:00:00,221,345,222,0.450131,1.300827,0.253846
2016-03-11T00:00:00,222,323,223,0.334536,2.028574,1.814572
2016-03-12T00:00:00,201,340,209,0.384711,0.004608,0.652716
2016-03-13T00:00:00,220,339,223,0.343874,2.501012,1.492216
2016-03-14T00:00:00,201,345,219,0.695901,0.572281,2.925244


In [37]:
tweet_view2['fan_index']

jleague,jordan,kyle,wayne
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2016-03-10T00:00:00,0.450131,1.300827,0.253846
2016-03-11T00:00:00,0.334536,2.028574,1.814572
2016-03-12T00:00:00,0.384711,0.004608,0.652716
2016-03-13T00:00:00,0.343874,2.501012,1.492216
2016-03-14T00:00:00,0.695901,0.572281,2.925244


# Removing duplicates and replacing values
---

In [38]:
# Dropping duplicates
dfd = dfa
dfd['zeta'] = [4, 1, 4, 1]
dfd

Unnamed: 0,iota,kappa,lambda,mu,zeta
0,11,21,31,41,4
1,13,25,32,49,1
2,11,21,31,41,4
3,11,21,31,42,1


In [39]:
dfd.duplicated()

0    False
1    False
2     True
3    False
dtype: bool

In [40]:
dfd.duplicated(['iota', 'kappa'])

0    False
1    False
2     True
3     True
dtype: bool

In [41]:
dfd.drop_duplicates(['iota', 'kappa'])

Unnamed: 0,iota,kappa,lambda,mu,zeta
0,11,21,31,41,4
1,13,25,32,49,1


In [42]:
dfd.drop_duplicates?

In [43]:
# Using .map()

# legend:
# 0 = 'm'
# 1 = 'f'

genders = {'selina kyle': '1',
           'bruce wayne': '0',
           'dinah lance': '1',
           'hal jordan': '0',
           'clark kent': '0',
           'barry allen': '0',
           'arthur curry': '0',
           'billy batson': '0',
           'barbara gordon': '1',
           'kara zor-el': '1',
           'john jones': '0',
           'diana prince': '1',
           'dick grayson': '0',
           'john jones': '0',
           'victor stone': '0',
           'ray palmer': '0',
           'john constantine': '0',
           'kyle rayner': '0',
           'wally west': '0'}


it = pd.read_csv('ig_tweets.csv')
it

Unnamed: 0,jleague,igs,tweets
0,billy batson,7,6
1,barbara GORDON,3,8
2,barbara gordon,9,5
3,john constantiNe,4,6
4,dinah lance,7,7
5,selina kyle,4,3
6,diana prince,6,9
7,selina kyle,3,11
8,arthur curry,8,10
9,Selina Kyle,6,14


In [44]:
# Uses a dictionary to map keys to values

it['gender'] = it['jleague'].map(genders)

In [45]:
it

Unnamed: 0,jleague,igs,tweets,gender
0,billy batson,7,6,0.0
1,barbara GORDON,3,8,
2,barbara gordon,9,5,1.0
3,john constantiNe,4,6,
4,dinah lance,7,7,1.0
5,selina kyle,4,3,1.0
6,diana prince,6,9,1.0
7,selina kyle,3,11,1.0
8,arthur curry,8,10,0.0
9,Selina Kyle,6,14,


In [46]:
# Run a function on the entire series using apply

it['jleagueLower'] = it['jleague'].apply(str.lower)
it

Unnamed: 0,jleague,igs,tweets,gender,jleagueLower
0,billy batson,7,6,0.0,billy batson
1,barbara GORDON,3,8,,barbara gordon
2,barbara gordon,9,5,1.0,barbara gordon
3,john constantiNe,4,6,,john constantine
4,dinah lance,7,7,1.0,dinah lance
5,selina kyle,4,3,1.0,selina kyle
6,diana prince,6,9,1.0,diana prince
7,selina kyle,3,11,1.0,selina kyle
8,arthur curry,8,10,0.0,arthur curry
9,Selina Kyle,6,14,,selina kyle


In [47]:
it['gender'] = it['jleagueLower'].map(genders)
it

Unnamed: 0,jleague,igs,tweets,gender,jleagueLower
0,billy batson,7,6,0,billy batson
1,barbara GORDON,3,8,1,barbara gordon
2,barbara gordon,9,5,1,barbara gordon
3,john constantiNe,4,6,0,john constantine
4,dinah lance,7,7,1,dinah lance
5,selina kyle,4,3,1,selina kyle
6,diana prince,6,9,1,diana prince
7,selina kyle,3,11,1,selina kyle
8,arthur curry,8,10,0,arthur curry
9,Selina Kyle,6,14,1,selina kyle


In [48]:
def gen_conv(name):
    gen = genders[name.lower()]
    if gen == '0':
        return 'm'
    elif gen == '1':
        return 'f'

In [49]:
it['gender'] = it['jleague'].apply(gen_conv)
it

Unnamed: 0,jleague,igs,tweets,gender,jleagueLower
0,billy batson,7,6,m,billy batson
1,barbara GORDON,3,8,f,barbara gordon
2,barbara gordon,9,5,f,barbara gordon
3,john constantiNe,4,6,m,john constantine
4,dinah lance,7,7,f,dinah lance
5,selina kyle,4,3,f,selina kyle
6,diana prince,6,9,f,diana prince
7,selina kyle,3,11,f,selina kyle
8,arthur curry,8,10,m,arthur curry
9,Selina Kyle,6,14,f,selina kyle


In [50]:
# You can also replace certain values wholesale if desired, using the replace() function
# Using .replace()
it.gender.replace('f', 'Female')

0          m
1     Female
2     Female
3          m
4     Female
5     Female
6     Female
7     Female
8          m
9     Female
10         m
11         m
12         m
13         m
14         m
15    Female
16    Female
17         m
18    Female
19         m
20         m
21    Female
22         m
23    Female
24         m
25         m
26    Female
27         m
28    Female
29         m
30         m
31    Female
32    Female
33         m
34         m
35         m
36         m
37    Female
38    Female
39    Female
40         m
41         m
42         m
43         m
44         m
45         m
46         m
47    Female
48         m
49         m
Name: gender, dtype: object

In [51]:
it.gender.replace(['f', 'm'], ['Female', 'Male'])

0       Male
1     Female
2     Female
3       Male
4     Female
5     Female
6     Female
7     Female
8       Male
9     Female
10      Male
11      Male
12      Male
13      Male
14      Male
15    Female
16    Female
17      Male
18    Female
19      Male
20      Male
21    Female
22      Male
23    Female
24      Male
25      Male
26    Female
27      Male
28    Female
29      Male
30      Male
31    Female
32    Female
33      Male
34      Male
35      Male
36      Male
37    Female
38    Female
39    Female
40      Male
41      Male
42      Male
43      Male
44      Male
45      Male
46      Male
47    Female
48      Male
49      Male
Name: gender, dtype: object

## Bins
---

In [52]:
msgs = it.tweets
bins = [2, 5, 9, 15]

categories = pd.cut(msgs, bins)

categories

0      (5, 9]
1      (5, 9]
2      (2, 5]
3      (5, 9]
4      (5, 9]
5      (2, 5]
6      (5, 9]
7     (9, 15]
8     (9, 15]
9     (9, 15]
10     (5, 9]
11     (5, 9]
12     (5, 9]
13     (5, 9]
14    (9, 15]
15     (2, 5]
16     (2, 5]
17    (9, 15]
18     (2, 5]
19     (2, 5]
20     (2, 5]
21     (5, 9]
22    (9, 15]
23    (9, 15]
24     (5, 9]
25     (5, 9]
26    (9, 15]
27    (9, 15]
28     (5, 9]
29     (5, 9]
30     (2, 5]
31     (5, 9]
32     (5, 9]
33     (5, 9]
34    (9, 15]
35    (9, 15]
36     (5, 9]
37     (2, 5]
38     (2, 5]
39    (9, 15]
40    (9, 15]
41    (9, 15]
42     (2, 5]
43    (9, 15]
44     (5, 9]
45     (2, 5]
46     (5, 9]
47     (2, 5]
48    (9, 15]
49    (9, 15]
Name: tweets, dtype: category
Categories (3, interval[int64]): [(2, 5] < (5, 9] < (9, 15]]

In [53]:
# math notation ... '(' open   OR exclusive
#                   ']' closed OR inclusive
# 2 < x <= 5        (2, 5]

#                   right=True/False

pd.value_counts(categories)

(5, 9]     20
(9, 15]    17
(2, 5]     13
Name: tweets, dtype: int64

In [54]:
labels = ['few', 'medium', "aren't there bad guys to catch"]
it['workload'] = pd.cut(it.tweets, bins, labels=labels)
it

Unnamed: 0,jleague,igs,tweets,gender,jleagueLower,workload
0,billy batson,7,6,m,billy batson,medium
1,barbara GORDON,3,8,f,barbara gordon,medium
2,barbara gordon,9,5,f,barbara gordon,few
3,john constantiNe,4,6,m,john constantine,medium
4,dinah lance,7,7,f,dinah lance,medium
5,selina kyle,4,3,f,selina kyle,few
6,diana prince,6,9,f,diana prince,medium
7,selina kyle,3,11,f,selina kyle,aren't there bad guys to catch
8,arthur curry,8,10,m,arthur curry,aren't there bad guys to catch
9,Selina Kyle,6,14,f,selina kyle,aren't there bad guys to catch


# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_bin_01.py
```

Execute your script in the **IPython interpreter** using the command:

```bash
run my_bin_01.py
```

Your script should do the following:

* Bring out your merged DataFrame from the last exercise
* Bin the payload column by 100_000 increments up to AND INCLUDING 1_000_000 with labels where you spell it out E.G.
    * `One hundred thousand`
    * `Two hundred thousand`
    * ...
    * `Nine hundred thousand`
    * `One million`
* Store the binned data in a new column called `bins`
* Create a pivot table using:
    * `lat` column as index
    * `long` column as column names
    * `bins` column as values

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../images/green_sticky.300px.png' width='200' style='float:left'>