# Welcome to the Dark Art of Coding:
## Introduction to Python for Data Analysis
Pandas: Series & DataFrames

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* understand the purpose and application of `pandas` to data analysis problems
* understand how to create and use a `Series`
* understand how to create and use a `DataFrame`
* explore various simple examples of pandas usage


# `pandas` basics
---

**`pandas`** is one of the premier data analysis libraries in the Python ecosystem. It offers high-performance, easy-to-use data structures and data analysis tools enabling you to carry out your entire data analysis workflow.

`pandas` is used for:

* data analysis/science
* financial analysis
* data manipulation
* data cleansing
* data transformation

`pandas` has tools to read and write data to and from multiple data formats.

It also includes tools that simplify:

* grouping data
* applying transformations to columns, rows and individual cells
* working with dates and times

# List vs. Dict vs. Series vs. DataFrame
---

To help better understand the `pandas` datatypes, it will help to review existing datatypes that you are already familiar with: `list` and `dict`. In this review, we will compare an contrast techniques to access or view data in `list`s and `dict`s.

## list 
```python                               
mylist = ['A', 'B', 'C']            
```

**indexable**: 

`mylist[0]` indexable by integer            

**sliceable**: 

`mylist[0:2]` sliceable by integer
 

## dict
```
mydict = {'alpha': 1,
          'beta': 2,
          'gamma': 2}
```

**indexable**: 

`mydict['alpha']` indexable by key


## Series
```
myseries = Series(['bruce', 'selina', 'kara', 'clark], index=[0, 1, 2, 'three'])

          column
rows
0         'bruce'
1         'selina'
2         'kara'
'three'   'clark'
```

**indexable**: 

`myseries[0]` indexable by integer

`myseries['three']` indexable by row name

**sliceable**:

`myseries[0:3]` sliceable by integer                 

## DataFrame:
```
mydataframe = DataFrame(lots of data... we'll show you how to make a dataframe shortly)

        col1      col2        col3      age
rows
0       'bruce'   'wayne'     'M'       42
1       'selina'  'kyle'      'F'       34
'two'   'kara'    'zor-el'    'F'       27
3       'clark'   'kent'      'M'       35
```

**indexable**

`mydataframe['col1']` indexable by column name (can also be indexed by row)

`mydataframe[['col1', 'age']]` indexable by multiple column/row indicators

# Series
---

In [1]:
# Let's start by making a simple Series.
# It is customary to import pandas by the alias: pd

import pandas as pd
from pandas import Series

s = Series([33, 37, 27, 42])

# pandas will assign an index automatically starting at "0"

s

0    33
1    37
2    27
3    42
dtype: int64

In [2]:
# We can see that the object is a Series object
type(s)

pandas.core.series.Series

In [3]:
s[0]

33

So, what's the difference between a `Series` and a `list`?

In [4]:
l = [33, 37, 27, 42]

In [None]:
# Let's use tab completion to explore the difference
# l.<tab complete>

l.

# There are 11 methods/attributes associated with lists...

In [5]:
# s.<tab complete>

s.
# There are 226 methods/attributes associated with lists...
# len(list(attr for attr in dir(s) if not attr.startswith('_')))

34.75

In [6]:
# Series objects can be assigned a name 
# The index (0, 1, 2...) can also be assigned directly/overwritten.

s.name = 'Justice League ages'
s.index = ['bruce', 'selina', 'kara', 'clark']

s

bruce     33
selina    37
kara      27
clark     42
Name: Justice League ages, dtype: int64

In [7]:
s['kara']

27

In [10]:
# NOTE: The Series factory function allows you to assign attributes
#     such as the index directly.

s1 = Series([37, 36, 10, 36],
            index=['hal', 'victor', 'diana', 'billy'],
            name='More Justice League ages')
s1

# Generally, any ordered iterable can be used to produce the inputs
#     to a Series. We used a list here.
#     range(), generators, arrays, are also acceptable inputs.

hal       37
victor    36
diana     10
billy     36
Name: More Justice League ages, dtype: int64

In [11]:
# Accessing a row directly uses brackets and the 
#     name of the row.

s1['billy']

36

In [12]:
# Accessing multiple rows uses the names of 
#     the rows embedded in an iterable (i.e. a list)

s1[['billy', 'hal', 'victor']]

billy     36
hal       37
victor    36
Name: More Justice League ages, dtype: int64

In [13]:
# Accessing multiple rows may also use slice notation.
#     slice notation can often be used with both integer indexes and
#     string indexes.
# Here we use string indexes... 

s1['hal':'diana']

hal       37
victor    36
diana     10
Name: More Justice League ages, dtype: int64

In [14]:
# slice notation using integers still works even if we have 
#     applied a string-based index.
# Here we use integer indexes... but NOTE: the difference in behavior
# * string indexing includes the last element.
# * integer indexing behaves more like Python and goes up to but does NOT include
#   the last element

s1[0:2]

hal       37
victor    36
Name: More Justice League ages, dtype: int64

In [15]:
# We can also index the Series using a list of items:
# NOTE: the order of the result matches the order of our
#     selection

s1[['billy', 'hal', 'diana']]

billy    36
hal      37
diana    10
Name: More Justice League ages, dtype: int64

In [16]:
# Similarly, assignment of a value to a specific row
#     uses bracket indexing and behaves similar to 
#     a list OR dict

s1['diana'] = 32
s1

hal       37
victor    36
diana     32
billy     36
Name: More Justice League ages, dtype: int64

It is possible to check to see if an element in a `Series` matches a specific value


In [17]:
s1['billy'] == 35

False

More holistically, it is possible to see which elements in a `Series` matches a specific value.

In [18]:
s1 == 32

hal       False
victor    False
diana      True
billy     False
Name: More Justice League ages, dtype: bool

In [19]:
result = s1 == 36

print(type(result), result, sep='\n\n')

<class 'pandas.core.series.Series'>

hal       False
victor     True
diana     False
billy      True
Name: More Justice League ages, dtype: bool


In [20]:
# Rows can be filtered using any such sequence of True and False

s1[result]

victor    36
billy     36
Name: More Justice League ages, dtype: int64

In [21]:
# We can also use any of the standard comparison operators
#     ==
#     <= 
#     >=
#     <
#     >
#
# In addition, there is no need to save the True/False series... you can do the evaluation
#     inside the indexing brackets

s1[s1 >= 33]

hal       37
victor    36
billy     36
Name: More Justice League ages, dtype: int64

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_series_01.py```

Create a pandas Series called `restaurant_ratings` according to the following guidelines:

* starting at the top, each row should contain one number from 1 to 5 (inclusive)
* give the series a `.name` called ratings
* give the series a `.index` with the names of five restaurants

Execute your script in **Jupyter** using the command:

```bash
run my_series_01.py```


From Jupyter, explore your `Series` object (`restaurant_ratings`) by performing our typical explorations:
* `type()`
* `.<tab complete>`

Also look at the attributes you have added: 

* `.name`
* `.index`

Lastly extract particular records from your `Series`:

* Choose a record from the `Series` using the name of one of your restaurants
* Choose three records from the `Series` using a list of names of restaurants



When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
run soln_series_01.py

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_series_02.py```

Create a pandas `Series` called `bacteria_lengths` according to the following guidelines:

* starting at the top, each row should contain one number from 1 to 5000 (inclusive) incrementing by 100
* give the `Series` a `.name` called length
* NOTE: do not worry about a `.index` for this `Series`. Simply use the default indexing that `Series` provide.

Execute your script from the **terminal/command line** using the command:

```bash
ipython -i my_series_02.py```


From the IPython Interpreter, explore your Series object by performing our typical explorations:
* `type()`
* `.<tab complete>`

Also look at the attributes you have added or that were created by default: 

* `.name`
* `.index`

Lastly extract particular records from your Series:

* Choose a record from the Series at index 23
* Choose three records from the Series at index 23-25

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [27]:
run soln_series_02.py

In [28]:
bacteria_lengths[23]

2301

In [29]:
bacteria_lengths[23:26]

23    2301
24    2401
25    2501
Name: lengths, dtype: int64

In [30]:
# Much like numpy, pandas Series (and DataFrames)
#     offer vector mathematics whereby you can add to
#     or multiply against all rows or cells
#     WITHOUT using a for loop.

s1 * 2

hal       74
victor    72
diana     64
billy     72
Name: More Justice League ages, dtype: int64

In [31]:
s1

hal       37
victor    36
diana     32
billy     36
Name: More Justice League ages, dtype: int64

In [32]:
# Knowing that the selection of specific elements from a Series simply returns
#     a Series, we can also perform vector multiplication against a 
#     selection

s1[['diana', 'billy']] * 20

diana    640
billy    720
Name: More Justice League ages, dtype: int64

In [33]:
s1

hal       37
victor    36
diana     32
billy     36
Name: More Justice League ages, dtype: int64

In [34]:
# Testing for inclusion in a Series behaves as with a list

'diana' in s1

True

In [35]:
# A test for inclusion in a list, for comparison

42 in [1, 2, 3, 4]

False

In [36]:
'lex' in s1

False

In [37]:
# Series can be built in many ways, including using a dictionary
#     to populate the row labels and the row contents

names = {'bruce wayne': 'bwayne@jleague.org',
         'hal jordan': 'hjordan@jleague.org',
         'clark kent': 'ckent@jleague.org',
         'barry allen': 'ballen@jleague.org',
         'diana prince': 'dprince@jleague.org',
         'arthur curry': 'acurry@jleague.org',
         'billy batson': 'bbatson@jleague.org',
         'john jones': 'jjones@jleague.org',
         'victor stone': 'vstone@jleague.org',
         'dick grayson': 'dgrayson@jleague.org',
         'ray palmer': 'rpalmer@jleague.org',
         'dinah lance': 'dlance@jleague.org',
         'kara zor-el': 'kzor-el@jleague.org',
         'john constantine': 'jconstantine@jleague.org',
         'barbara gordon': 'bgordon@jleague.org',
         'kyle rayner': 'krayner@jleague.org',
         'selina kyle': 'skyle@jleague.org',
         'wally west': 'wwest@jleague.org',
         }

emails = Series(names)
# emails.index
# emails.values

In [38]:
emails

bruce wayne               bwayne@jleague.org
hal jordan               hjordan@jleague.org
clark kent                 ckent@jleague.org
barry allen               ballen@jleague.org
diana prince             dprince@jleague.org
arthur curry              acurry@jleague.org
billy batson             bbatson@jleague.org
john jones                jjones@jleague.org
victor stone              vstone@jleague.org
dick grayson            dgrayson@jleague.org
ray palmer               rpalmer@jleague.org
dinah lance               dlance@jleague.org
kara zor-el              kzor-el@jleague.org
john constantine    jconstantine@jleague.org
barbara gordon           bgordon@jleague.org
kyle rayner              krayner@jleague.org
selina kyle                skyle@jleague.org
wally west                 wwest@jleague.org
dtype: object

In [39]:
emails[    ['barry allen',  'selina kyle']     ]

barry allen    ballen@jleague.org
selina kyle     skyle@jleague.org
dtype: object

# Analyzing data
---

In [40]:
s1 = Series(range(10, 16), index=['a', 'b', 'c', 'd', 'e', 'f'])
s2 = Series(range(16, 22), index=['a', 'b', 'c', 'x', 'y', 'z'])

# s1
# s2

In [41]:
s2

a    16
b    17
c    18
x    19
y    20
z    21
dtype: int64

In [42]:
s1

a    10
b    11
c    12
d    13
e    14
f    15
dtype: int64

In [43]:
s3 = s1 + s2
s3

# type(s3)
# pd.isnull(s3)
# s3.isnull()
# s3.<tab>

a    26.0
b    28.0
c    30.0
d     NaN
e     NaN
f     NaN
x     NaN
y     NaN
z     NaN
dtype: float64

In [44]:
pd.concat([s1, s2], ignore_index=True)

0     10
1     11
2     12
3     13
4     14
5     15
6     16
7     17
8     18
9     19
10    20
11    21
dtype: int64

In [None]:
# How do I learn more?
# s3.<method_name>?        # just ask by typing the method name (sans parenthesis) and 
#                          # adding a question mark to see the builtin help docs
# 
# s3.value_counts?
# s3.value_counts(dropna=False)

In [45]:
s3.value_counts?

[0;31mSignature:[0m
[0ms3[0m[0;34m.[0m[0mvalue_counts[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mnormalize[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msort[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mascending[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mbins[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdropna[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a Series containing counts of unique values.

The resulting object will be in descending order so that the
first element is the most frequently-occurring element.
Excludes NA values by default.

Parameters
----------
normalize : bool, default False
    If True then the object returned will contain th

In [46]:
s3.value_counts(dropna=False, ascending=True)

26.0    1
28.0    1
30.0    1
NaN     6
dtype: int64

In [47]:
s3.dropna()
# s3

a    26.0
b    28.0
c    30.0
dtype: float64

In [48]:
s4 = Series([42, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 6])

# s4.unique()
# s4.value_counts()
# s4.max()
# s4 + 2

In [49]:
s4.value_counts()

3     6
1     3
2     2
42    1
4     1
5     1
6     1
dtype: int64

In [50]:
def transmogrifier(x):
    '''hat tip to Calvin and Hobbes for introducing me to this 
    truly fantastic word. thanks, bill watterson.

    "transform, especially in a surprising or magical manner."
    '''
    new_val = '- ' + str(x ** 3) + ' -'
    return new_val

In [51]:
s4.apply(transmogrifier)

0     - 74088 -
1         - 1 -
2         - 1 -
3         - 1 -
4         - 8 -
5         - 8 -
6        - 27 -
7        - 27 -
8        - 27 -
9        - 27 -
10       - 27 -
11       - 27 -
12       - 64 -
13      - 125 -
14      - 216 -
dtype: object

# DataFrames
---

In [52]:
from pandas import DataFrame

In [53]:
# Making a DataFrame # 1
# Using a dictionary:

data = {'hero': ['billy', 'billy', 'billy', 'selina', 'selina'],
        'date': ['Jan 10', 'Jan 11', 'Jan 12', 'Jan 10', 'Jan 11'],
        'emails': [111, 121, 93, 211, 210]}

df = DataFrame(data)
df

Unnamed: 0,hero,date,emails
0,billy,Jan 10,111
1,billy,Jan 11,121
2,billy,Jan 12,93
3,selina,Jan 10,211
4,selina,Jan 11,210


In [54]:
df = DataFrame(data, columns=['date', 'hero', 'emails'])
df

Unnamed: 0,date,hero,emails
0,Jan 10,billy,111
1,Jan 11,billy,121
2,Jan 12,billy,93
3,Jan 10,selina,211
4,Jan 11,selina,210


In [55]:
df = DataFrame(data, columns=['date', 'hero', 'emails', 'instagrams'],
               index=[1, 2, 3, 4, 5])
df

# df
# df.columns

Unnamed: 0,date,hero,emails,instagrams
1,Jan 10,billy,111,
2,Jan 11,billy,121,
3,Jan 12,billy,93,
4,Jan 10,selina,211,
5,Jan 11,selina,210,


In [56]:
df['instagrams'] = 42

In [57]:
df

Unnamed: 0,date,hero,emails,instagrams
1,Jan 10,billy,111,42
2,Jan 11,billy,121,42
3,Jan 12,billy,93,42
4,Jan 10,selina,211,42
5,Jan 11,selina,210,42


In [58]:
df[['date', 'emails']]

Unnamed: 0,date,emails
1,Jan 10,111
2,Jan 11,121
3,Jan 12,93
4,Jan 10,211
5,Jan 11,210


In [59]:
# df['hero']
# df.hero

df.loc[1]

date          Jan 10
hero           billy
emails           111
instagrams        42
Name: 1, dtype: object

In [60]:
df.loc[1:2]

Unnamed: 0,date,hero,emails,instagrams
1,Jan 10,billy,111,42
2,Jan 11,billy,121,42


In [61]:
df.loc[1:4:2]

Unnamed: 0,date,hero,emails,instagrams
1,Jan 10,billy,111,42
3,Jan 12,billy,93,42


In [62]:
from pandas import Series

df.instagrams = 50

In [63]:
df.emails

1    111
2    121
3     93
4    211
5    210
Name: emails, dtype: int64

In [64]:
ins = Series([10, 20, 30], index=[0, 2, 4])
ins

0    10
2    20
4    30
dtype: int64

In [65]:
df['instagrams'] = ins
df

Unnamed: 0,date,hero,emails,instagrams
1,Jan 10,billy,111,
2,Jan 11,billy,121,20.0
3,Jan 12,billy,93,
4,Jan 10,selina,211,30.0
5,Jan 11,selina,210,


In [66]:
# If you want to add a new column, dataframes are completely
#     mutable: columns can be added at will.

df['overworked'] = df['emails'] >= 120
df

Unnamed: 0,date,hero,emails,instagrams,overworked
1,Jan 10,billy,111,,False
2,Jan 11,billy,121,20.0,True
3,Jan 12,billy,93,,False
4,Jan 10,selina,211,30.0,True
5,Jan 11,selina,210,,True


In [67]:
df[    df['overworked'] == False       ]

Unnamed: 0,date,hero,emails,instagrams,overworked
1,Jan 10,billy,111,,False
3,Jan 12,billy,93,,False


In [68]:
# If you want to add a new column, dataframes are completely
#     mutable: columns can be added at will.

standalone_series = df['emails'] >= 120
standalone_series

1    False
2     True
3    False
4     True
5     True
Name: emails, dtype: bool

In [69]:
mask = df['instagrams'] != 20.0
mask

1     True
2    False
3     True
4     True
5     True
Name: instagrams, dtype: bool

In [70]:
df[mask]

Unnamed: 0,date,hero,emails,instagrams,overworked
1,Jan 10,billy,111,,False
3,Jan 12,billy,93,,False
4,Jan 10,selina,211,30.0,True
5,Jan 11,selina,210,,True


In [72]:
df[df['date'] != 'Jan 10']

Unnamed: 0,date,hero,emails,instagrams,overworked
2,Jan 11,billy,121,20.0,True
3,Jan 12,billy,93,,False
5,Jan 11,selina,210,,True


In [73]:
# Making a DataFrame # 2
# Using a dictionary with nested dictionaries...

data = {'billy': {'Jan 10': 202, 'Jan 11': 220, 'Jan 12': 198},
        'selina': {'Jan 09': 246, 'Jan 10': 235, 'Jan 11': 243}}

In [74]:
df2 = DataFrame(data)
df2

Unnamed: 0,billy,selina
Jan 10,202.0,235.0
Jan 11,220.0,243.0
Jan 12,198.0,
Jan 09,,246.0


In [75]:
# df2.T
dft = df2.T
dft

Unnamed: 0,Jan 10,Jan 11,Jan 12,Jan 09
billy,202.0,220.0,198.0,
selina,235.0,243.0,,246.0


In [76]:
dft.columns.name = 'date'
dft.index.name = 'hero'

In [77]:
dft

date,Jan 10,Jan 11,Jan 12,Jan 09
hero,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
billy,202.0,220.0,198.0,
selina,235.0,243.0,,246.0


In [78]:
# using indexes
nums = Series(range(10, 16),
              index=['t', 'u', 'v', 'x', 'y', 'z'])
nums

t    10
u    11
v    12
x    13
y    14
z    15
dtype: int64

In [79]:
i = nums.index
print(type(i), i)

<class 'pandas.core.indexes.base.Index'> Index(['t', 'u', 'v', 'x', 'y', 'z'], dtype='object')


In [80]:
i[::3]

# i[2:4]
# i[::2]
# i[::3]
# i[4]

Index(['t', 'x'], dtype='object')

In [82]:
logs = pd.read_csv('../universal_datasets/log_file_1000.csv', names=['name',
                                               'email',
                                               'fm_ip',
                                               'to_ip',
                                               'date_time',
                                               'lat',
                                               'long',
                                               'payload_size'])

In [83]:
logs

Unnamed: 0,name,email,fm_ip,to_ip,date_time,lat,long,payload_size
0,barry allen,ballen@jleague.org,155.130.121.215,75.122.133.241,2016-02-08T21:44:41,49.83160,8.01485,764272
1,arthur curry,acurry@jleague.org,106.152.115.161,106.152.114.248,2016-02-08T21:45:37,45.10327,11.68293,249206
2,john jones,jjones@jleague.org,60.15.193.250,155.130.121.215,2016-02-08T21:46:53,47.11673,10.35874,856820
3,wally west,wwest@jleague.org,190.214.22.201,190.214.22.116,2016-02-07T21:47:12,46.75616,11.47886,593774
4,arthur curry,acurry@jleague.org,60.15.193.74,60.15.193.95,2016-02-07T21:48:04,48.59134,12.30683,171910
...,...,...,...,...,...,...,...,...
995,clark kent,ckent@jleague.org,106.152.114.248,60.15.193.249,2015-09-05T11:29:55,48.90180,12.26173,572009
996,billy batson,bbatson@jleague.org,102.86.56.213,60.15.193.95,2015-09-05T11:31:10,46.79957,11.39590,520973
997,bruce wayne,bwayne@jleague.org,60.15.193.95,75.122.133.10,2015-09-05T11:31:47,47.22726,8.13069,193138
998,ray palmer,rpalmer@jleague.org,155.130.121.215,169.28.228.152,2015-09-04T11:32:18,48.28877,9.67684,575482


In [84]:
pd.read_csv?

[0;31mSignature:[0m
[0mpd[0m[0;34m.[0m[0mread_csv[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mfilepath_or_buffer[0m[0;34m:[0m [0;34m'FilePathOrBuffer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msep[0m[0;34m=[0m[0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelimiter[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mheader[0m[0;34m=[0m[0;34m'infer'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mnames[0m[0;34m=[0m[0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex_col[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0musecols[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msqueeze[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mprefix[0m[0;34m=[0m[0;34m<[0m[0mno_default[0m[0;34m>[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mmangle_dupe_cols[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[

In [85]:
logs['fm_ip'].unique()

array(['155.130.121.215', '106.152.115.161', '60.15.193.250',
       '190.214.22.201', '60.15.193.74', '106.152.114.248',
       '60.15.193.95', '220.211.18.48', '220.211.18.31', '190.214.22.116',
       '155.130.121.6', '106.152.114.9', '102.86.56.199',
       '75.122.132.124', '190.214.22.94', '60.15.193.249',
       '155.130.120.114', '75.122.133.75', '220.211.18.12',
       '102.86.56.213', '169.28.228.153', '102.86.56.243',
       '102.86.56.203', '106.152.115.130', '190.214.22.59',
       '75.122.133.241', '75.122.133.10', '169.28.228.152',
       '155.130.121.22', '106.152.115.49'], dtype=object)

In [86]:
logs['name'].value_counts()

kyle rayner         63
kara zor-el         62
victor stone        62
clark kent          62
barbara gordon      62
diana prince        61
hal jordan          61
arthur curry        60
john jones          57
bruce wayne         55
ray palmer          54
dinah lance         53
billy batson        52
dick grayson        51
wally west          50
john constantine    46
selina kyle         46
barry allen         43
Name: name, dtype: int64

In [87]:
logs['name'].tail(13)

987        dick grayson
988        dick grayson
989        arthur curry
990        arthur curry
991      barbara gordon
992    john constantine
993        arthur curry
994         selina kyle
995          clark kent
996        billy batson
997         bruce wayne
998          ray palmer
999         bruce wayne
Name: name, dtype: object

In [88]:
g = logs.groupby(logs['fm_ip'])

type(g)


pandas.core.groupby.generic.DataFrameGroupBy

In [89]:
g.ngroups

30

In [90]:
g.first()

Unnamed: 0_level_0,name,email,to_ip,date_time,lat,long,payload_size
fm_ip,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
102.86.56.199,diana prince,dprince@jleague.org,60.15.193.74,2016-02-05T21:56:07,47.64781,11.90056,622470
102.86.56.203,hal jordan,hjordan@jleague.org,102.86.56.213,2016-01-31T22:15:49,46.56542,11.1041,481866
102.86.56.213,hal jordan,hjordan@jleague.org,102.86.56.213,2016-01-31T22:13:56,45.45445,11.0413,625106
102.86.56.243,bruce wayne,bwayne@jleague.org,106.152.115.49,2016-01-31T22:15:03,45.67383,9.921,768775
106.152.114.248,ray palmer,rpalmer@jleague.org,102.86.56.203,2016-02-07T21:48:54,45.23082,10.90642,300389
106.152.114.9,barry allen,ballen@jleague.org,102.86.56.243,2016-02-05T21:54:59,48.03074,10.41633,87758
106.152.115.130,dick grayson,dgrayson@jleague.org,220.211.18.31,2016-01-29T22:26:49,48.17151,10.57354,426194
106.152.115.161,arthur curry,acurry@jleague.org,106.152.114.248,2016-02-08T21:45:37,45.10327,11.68293,249206
106.152.115.49,barbara gordon,bgordon@jleague.org,220.211.18.12,2016-01-05T00:54:23,46.25667,12.25382,671268
155.130.120.114,diana prince,dprince@jleague.org,106.152.114.9,2016-02-03T22:03:55,49.89778,8.25535,174355


In [91]:
for item in g:
    print(item)

('102.86.56.199',                name                 email          fm_ip            to_ip  \
15     diana prince   dprince@jleague.org  102.86.56.199     60.15.193.74   
22       wally west     wwest@jleague.org  102.86.56.199    106.152.114.9   
51     victor stone    vstone@jleague.org  102.86.56.199   169.28.228.153   
55     dick grayson  dgrayson@jleague.org  102.86.56.199    60.15.193.250   
66      barry allen    ballen@jleague.org  102.86.56.199   75.122.132.124   
91     arthur curry    acurry@jleague.org  102.86.56.199   155.130.121.22   
106    victor stone    vstone@jleague.org  102.86.56.199   169.28.228.152   
127    billy batson   bbatson@jleague.org  102.86.56.199    102.86.56.243   
143      ray palmer   rpalmer@jleague.org  102.86.56.199   169.28.228.152   
170  barbara gordon   bgordon@jleague.org  102.86.56.199   190.214.22.116   
189    billy batson   bbatson@jleague.org  102.86.56.199    106.152.114.9   
212      john jones    jjones@jleague.org  102.86.56.199  

In [92]:
type(g.get_group('106.152.115.161'))

pandas.core.frame.DataFrame

In [93]:
g.get_group('106.152.115.161').head(10)

Unnamed: 0,name,email,fm_ip,to_ip,date_time,lat,long,payload_size
1,arthur curry,acurry@jleague.org,106.152.115.161,106.152.114.248,2016-02-08T21:45:37,45.10327,11.68293,249206
124,john constantine,jconstantine@jleague.org,106.152.115.161,106.152.115.49,2016-01-18T23:26:50,46.51786,7.5144,197413
137,john jones,jjones@jleague.org,106.152.115.161,75.122.132.124,2016-01-17T23:39:15,47.59412,11.93051,141543
149,billy batson,bbatson@jleague.org,106.152.115.161,75.122.133.10,2016-01-14T23:48:04,48.42538,10.99307,395075
168,clark kent,ckent@jleague.org,106.152.115.161,106.152.115.130,2016-01-12T00:05:46,47.15144,8.52435,924131
173,barry allen,ballen@jleague.org,106.152.115.161,60.15.193.249,2016-01-12T00:09:58,45.83844,8.21253,621909
188,arthur curry,acurry@jleague.org,106.152.115.161,75.122.133.75,2016-01-11T00:22:31,48.3262,9.66367,15034
201,kara zor-el,kzor-el@jleague.org,106.152.115.161,75.122.133.241,2016-01-08T00:36:08,49.26437,12.09269,760026
210,billy batson,bbatson@jleague.org,106.152.115.161,190.214.22.94,2016-01-06T00:43:44,47.02258,7.75461,389068
259,billy batson,bbatson@jleague.org,106.152.115.161,75.122.132.124,2015-12-31T01:24:57,48.9545,9.75673,503714


In [99]:
def date_only(dt):
    day = dt.split('T')[0]
    return day

In [101]:
logs['date'] = logs['date_time'].apply(date_only)

In [102]:
logs.columns

Index(['name', 'email', 'fm_ip', 'to_ip', 'date_time', 'lat', 'long',
       'payload_size', 'date'],
      dtype='object')

In [103]:
logs.date

0      2016-02-08
1      2016-02-08
2      2016-02-08
3      2016-02-07
4      2016-02-07
          ...    
995    2015-09-05
996    2015-09-05
997    2015-09-05
998    2015-09-04
999    2015-09-04
Name: date, Length: 1000, dtype: object

In [104]:
tf = logs.fm_ip == logs.to_ip

In [110]:
tf.head(12)

0     False
1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
11     True
dtype: bool

In [106]:
tf.unique()

array([False,  True])

In [107]:
tf.value_counts()

False    967
True      33
dtype: int64

In [108]:
logs[['fm_ip', 'to_ip']].head(12)

Unnamed: 0,fm_ip,to_ip
0,155.130.121.215,75.122.133.241
1,106.152.115.161,106.152.114.248
2,60.15.193.250,155.130.121.215
3,190.214.22.201,190.214.22.116
4,60.15.193.74,60.15.193.95
5,106.152.114.248,102.86.56.203
6,60.15.193.95,155.130.121.215
7,220.211.18.48,106.152.115.49
8,106.152.114.248,190.214.22.201
9,60.15.193.74,106.152.114.248


# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_DataFrame_01.py
```

Follow these steps:

1. Create a pandas `DataFrame` called `na_log` by reading in the csv `log_file_na.csv` and using the names:
    
    ```['name', 'email', 'fm_ip', 'to_ip', 'date_time', 'lat', 'long', 'payload_size']```
1. Get the `payload_size` column and select it and label it as a separate pandas `Series`
1. Run both the `min()` method and the `max()` method on your `Series`
1. Calculate the difference between the two values you get back

Execute your script from the **terminal/command line** using the command:

```bash
ipython -i my_DataFrame_01.py
```

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
run soln_dataframe_01.py

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_DataFrame_02.py
```

Then do the following

1. Create a pandas dataFrame called `na_log` by reading in the csv `log_file_na.csv` and using the names:
    1. `['name', 'email', 'fm_ip', 'to_ip', 'date_time', 'lat', 'long', 'payload_size']`
1. Print to the screen the content of the two columns: `long` and `lat`
1. Create a new Series that is made up of the **difference** between the value of `long` and the value of `lat`
1. Apply the `round()` function to the Series so that each value is rounded to the nearest full integer
1. Use the `.unique()` method to print out all of the unique values

Execute your script from the **terminal/command line** using the command:

```bash
ipython -i my_DataFrame_02.py
```

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
(logs.lat - logs.long).apply(round).value_counts()

# BACKUP...