# Welcome to the Dark Art of Coding:
## Introduction to Python
pandas: Series & DataFrames

<img src='../universal_images/dark_art_logo.600px.png' width='300' style="float:right">

# Objectives
---

In this session, students should expect to:

* understand the purpose and application of `pandas` to data analysis problems
* understand how to create and use a `Series`
* understand how to create and use a `DataFrame`
* explore various simple examples of pandas usage


# `pandas` basics
---

**`pandas`** is one of the premier data analysis libraries in the Python ecosystem. It offers high-performance, easy-to-use data structures and data analysis tools enabling you to carry out your entire data analysis workflow.

`pandas` is used for:

* data analysis/science
* financial analysis
* data manipulation
* data cleansing
* data transformation

`pandas` has tools to read and write data to and from multiple data formats.

It also includes tools that simplify:

* grouping data
* applying transformations to columns, rows and individual cells
* working with dates and times

# List vs. Dict vs. Series vs. DataFrame
---

To help better understand the `pandas` datatypes, it will help to review existing datatypes that you are already familiar with: `list` and `dict`. In this review, we will compare an contrast techniques to access or view data in `list`s and `dict`s.

## list 
```python                               
mylist = ['A', 'B', 'C']            
```

**indexable**: 

`mylist[0]` indexable by integer            

**sliceable**: 

`mylist[0:2]` sliceable by integer
 

## dict
```
mydict = {'alpha': 1,
          'beta': 2,
          'gamma': 2}
```

**indexable**: 

`mydict['alpha']` indexable by key


## Series
```
myseries = Series(['bruce', 'selina', 'kara', 'clark], index=[0, 1, 2, 'three'])

          column
rows
0         'bruce'
1         'selina'
2         'kara'
'three'   'clark'
```

**indexable**: 

`myseries[0]` indexable by integer

`myseries['three']` indexable by row name

**sliceable**:

`myseries[0:3]` sliceable by integer                 

## DataFrame:
```
mydataframe = DataFrame(lots of data... we'll show you how to make a dataframe shortly)

        col1      col2        col3      age
rows
0       'bruce'   'wayne'     'M'       42
1       'selina'  'kyle'      'F'       34
'two'   'kara'    'zor-el'    'F'       27
3       'clark'   'kent'      'M'       35
```

**indexable**

`mydataframe['col1']` indexable by column name (can also be indexed by row)

`mydataframe[['col1', 'age']]` indexable by multiple column/row indicators

# Series
---

In [None]:
# Let's start by making a simple Series.
# It is customary to import pandas by the alias: pd

import pandas as pd
from pandas import Series

s = Series([33, 37, 27, 42])

# pandas will assign an index automatically starting at "0"

s

In [None]:
# We can see that the object is a Series object
type(s)

In [None]:
s[0]

So, what's the difference between a `Series` and a `list`?

In [None]:
l = [33, 37, 27, 42]

In [None]:
# Let's use tab completion to explore the difference
# l.<tab complete>

l.

# There are 11 methods/attributes associated with lists...

In [None]:
# s.<tab complete>

s.

# There are 226 methods/attributes associated with lists...
# len(list(attr for attr in dir(s) if not attr.startswith('_')))

In [None]:
# Series objects can be assigned a name 
# The index (0, 1, 2...) can also be assigned directly/overwritten.

s.name = 'Justice League ages'
s.index = ['bruce', 'selina', 'kara', 'clark']

s

In [None]:
s['kara']

In [None]:
# NOTE: The Series factory function allows you to assign attributes
#     such as the index directly.

s1 = Series([37, 36, 10, 36],
            index=['hal', 'victor', 'diana', 'billy'],
            name='More Justice League ages')
s1

# Generally, any ordered iterable can be used to produce the inputs
#     to a Series. We used a list here.
#     range(), generators, arrays, are also acceptable inputs.

In [None]:
# Accessing a row directly uses brackets and the 
#     name of the row.

s1['billy']

In [None]:
# Accessing multiple rows uses the names of 
#     the rows embedded in an iterable (i.e. a list)

s1[['billy', 'hal', 'victor']]

In [None]:
# Accessing multiple rows may also use slice notation.
#     slice notation can often be used with both integer indexes and
#     string indexes.
# Here we use string indexes... 

s1['hal':'diana']

In [None]:
# slice notation using integers still works even if we have 
#     applied a string-based index.
# Here we use integer indexes... but NOTE: the difference in behavior
# * string indexing includes the last element.
# * integer indexing behaves more like Python and goes up to but does NOT include
#   the last element

s1[0:2]

In [None]:
# We can also index the Series using a list of items:
# NOTE: the order of the result matches the order of our
#     selection

s1[['billy', 'hal', 'diana']]

In [None]:
# Similarly, assignment of a value to a specific row
#     uses bracket indexing and behaves similar to 
#     a list OR dict

s1['diana'] = 32
s1

It is possible to check to see if an element in a `Series` matches a specific value


In [None]:
s1['billy'] == 35

More holistically, it is possible to see which elements in a `Series` matches a specific value.

In [None]:
s1 == 32

In [None]:
result = s1 == 36

print(type(result), result, sep='\n\n')

In [None]:
# Rows can be filtered using any such sequence of True and False

s1[result]

In [None]:
# We can also use any of the standard comparison operators
#     ==
#     <= 
#     >=
#     <
#     >
#
# In addition, there is no need to save the True/False series... you can do the evaluation
#     inside the indexing brackets

s1[s1 >= 33]

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_series_01.py```

Create a pandas Series called `restaurant_ratings` according to the following guidelines:

* starting at the top, each row should contain one number from 1 to 5 (inclusive)
* give the series a `.name` called ratings
* give the series a `.index` with the names of five restaurants

Execute your script in **Jupyter** using the command:

```bash
run my_series_01.py```


From Jupyter, explore your `Series` object (`restaurant_ratings`) by performing our typical explorations:
* `type()`
* `.<tab complete>`

Also look at the attributes you have added: 

* `.name`
* `.index`

Lastly extract particular records from your `Series`:

* Choose a record from the `Series` using the name of one of your restaurants
* Choose three records from the `Series` using a list of names of restaurants



When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
run soln_series_01.py

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_series_02.py```

Create a pandas `Series` called `bacteria_lengths` according to the following guidelines:

* starting at the top, each row should contain one number from 1 to 5000 (inclusive) incrementing by 100
* give the `Series` a `.name` called length
* NOTE: do not worry about a `.index` for this `Series`. Simply use the default indexing that `Series` provide.

Execute your script from the **terminal/command line** using the command:

```bash
ipython -i my_series_02.py```


From the IPython Interpreter, explore your Series object by performing our typical explorations:
* `type()`
* `.<tab complete>`

Also look at the attributes you have added or that were created by default: 

* `.name`
* `.index`

Lastly extract particular records from your Series:

* Choose a record from the Series at index 23
* Choose three records from the Series at index 23-25

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
run soln_series_02.py

In [None]:
# bacteria_lengths[23]

In [None]:
# bacteria_lengths[23:26]

In [None]:
# Much like numpy, pandas Series (and DataFrames)
#     offer vector mathematics whereby you can add to
#     or multiply against all rows or cells
#     WITHOUT using a for loop.

s1 * 2

In [None]:
s1

In [None]:
# Knowing that the selection of specific elements from a Series simply returns
#     a Series, we can also perform vector multiplication against a 
#     selection

s1[['diana', 'billy']] * 20

In [None]:
s1

In [None]:
# Testing for inclusion in a Series behaves as with a list

'diana' in s1

In [None]:
# A test for inclusion in a list, for comparison

42 in [1, 2, 3, 4]

In [None]:
'lex' in s1

In [None]:
# Series can be built in many ways, including using a dictionary
#     to populate the row labels and the row contents

names = {'bruce wayne': 'bwayne@jleague.org',
         'hal jordan': 'hjordan@jleague.org',
         'clark kent': 'ckent@jleague.org',
         'barry allen': 'ballen@jleague.org',
         'diana prince': 'dprince@jleague.org',
         'arthur curry': 'acurry@jleague.org',
         'billy batson': 'bbatson@jleague.org',
         'john jones': 'jjones@jleague.org',
         'victor stone': 'vstone@jleague.org',
         'dick grayson': 'dgrayson@jleague.org',
         'ray palmer': 'rpalmer@jleague.org',
         'dinah lance': 'dlance@jleague.org',
         'kara zor-el': 'kzor-el@jleague.org',
         'john constantine': 'jconstantine@jleague.org',
         'barbara gordon': 'bgordon@jleague.org',
         'kyle rayner': 'krayner@jleague.org',
         'selina kyle': 'skyle@jleague.org',
         'wally west': 'wwest@jleague.org',
         }

emails = Series(names)
# emails.index
# emails.values

In [None]:
emails

In [None]:
emails[    ['barry allen',  'selina kyle']     ]

# Analyzing data
---

In [None]:
s1 = Series(range(10, 16), index=['a', 'b', 'c', 'd', 'e', 'f'])
s2 = Series(range(16, 22), index=['a', 'b', 'c', 'x', 'y', 'z'])

# s1
# s2

In [None]:
s2

In [None]:
s1

In [None]:
s3 = s1 + s2
s3

# type(s3)
# pd.isnull(s3)
# s3.isnull()
# s3.<tab>

In [None]:
pd.concat([s1, s2], ignore_index=True)

In [None]:
# How do I learn more?
# s3.<method_name>?        # just ask by typing the method name (sans parenthesis) and 
#                          # adding a question mark to see the builtin help docs
# 
# s3.value_counts?
# s3.value_counts(dropna=False)

In [None]:
s3.value_counts?

In [None]:
s3.value_counts(dropna=False, ascending=True)

In [None]:
s3.dropna()
# s3

In [None]:
s4 = Series([42, 1, 1, 1, 2, 2, 3, 3, 3, 3, 3, 3, 4, 5, 6])

# s4.unique()
# s4.value_counts()
# s4.max()
# s4 + 2

In [None]:
s4.value_counts()

In [None]:
def transmogrifier(x):
    '''hat tip to Calvin and Hobbes for introducing me to this 
    truly fantastic word. thanks, bill watterson.

    "transform, especially in a surprising or magical manner."
    '''
    new_val = '- ' + str(x ** 3) + ' -'
    return new_val

In [None]:
s4.apply(transmogrifier)

# DataFrames
---

In [None]:
from pandas import DataFrame

In [None]:
# Making a DataFrame # 1
# Using a dictionary:

data = {'hero': ['billy', 'billy', 'billy', 'selina', 'selina'],
        'date': ['Jan 10', 'Jan 11', 'Jan 12', 'Jan 10', 'Jan 11'],
        'emails': [111, 121, 93, 211, 210]}

df = DataFrame(data)
df

In [None]:
df = DataFrame(data, columns=['date', 'hero', 'emails'])
df

In [None]:
df = DataFrame(data, columns=['date', 'hero', 'emails', 'instagrams'],
               index=[1, 2, 3, 4, 5])
df

# df
# df.columns

In [None]:
df['instagrams'] = 42

In [None]:
df

In [None]:
df[['date', 'emails']]

In [None]:
# df['hero']
# df.hero

df.loc[1]

In [None]:
df.loc[1:2]

In [None]:
df.loc[1:4:2]

In [None]:
from pandas import Series

df.instagrams = 50

In [None]:
df.emails

In [None]:
ins = Series([10, 20, 30], index=[0, 2, 4])
ins

In [None]:
df['instagrams'] = ins
df

In [None]:
# If you want to add a new column, dataframes are completely
#     mutable: columns can be added at will.

df['overworked'] = df['emails'] >= 120
df

In [None]:
df[    df['overworked'] == False       ]

In [None]:
# If you want to add a new column, dataframes are completely
#     mutable: columns can be added at will.

standalone_series = df['emails'] >= 120
standalone_series

In [None]:
mask = df['instagrams'] != 20.0
mask

In [None]:
df[mask]

In [None]:
df[df['date'] != 'Jan 10']

In [None]:
# Making a DataFrame # 2
# Using a dictionary with nested dictionaries...

data = {'billy': {'Jan 10': 202, 'Jan 11': 220, 'Jan 12': 198},
        'selina': {'Jan 09': 246, 'Jan 10': 235, 'Jan 11': 243}}

In [None]:
df2 = DataFrame(data)
df2

In [None]:
# df2.T
dft = df2.T
dft

In [None]:
dft.columns.name = 'date'
dft.index.name = 'hero'

In [None]:
dft

In [None]:
# using indexes
nums = Series(range(10, 16),
              index=['t', 'u', 'v', 'x', 'y', 'z'])
nums

In [None]:
i = nums.index
print(type(i), i)

In [None]:
i[::3]

# i[2:4]
# i[::2]
# i[::3]
# i[4]

In [None]:
logs = pd.read_csv('log_file_1000.csv', names=['name',
                                               'email',
                                               'fm_ip',
                                               'to_ip',
                                               'date_time',
                                               'lat',
                                               'long',
                                               'payload_size'])

In [None]:
logs

In [None]:
pd.read_csv?

In [None]:
logs['fm_ip'].unique()

In [None]:
logs['name'].value_counts()

In [None]:
logs['name'].tail(13)

In [None]:
g = logs.groupby(logs['fm_ip'])

type(g)


In [None]:
g.ngroups

In [None]:
g.first()

In [None]:
for item in g:
    print(item)

In [None]:
type(g.get_group('106.152.115.161'))

In [None]:
g.get_group('106.152.115.161').head(10)

In [None]:
def date_only(dt):
    day = dt.split('T')[0]
    return day

In [None]:
logs['date'] = logs['date_time'].apply(date_only)

In [None]:
logs.columns

In [None]:
logs.date

In [None]:
tf = logs.fm_ip == logs.to_ip

In [None]:
tf

In [None]:
tf.unique()

In [None]:
tf.value_counts()

In [None]:
logs[['fm_ip', 'to_ip']].head(12)

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_DataFrame_01.py
```

Follow these steps:

1. Create a pandas `DataFrame` called `na_log` by reading in the csv `log_file_na.csv` and using the names:
    
    ```['name', 'email', 'fm_ip', 'to_ip', 'date_time', 'lat', 'long', 'payload_size']```
1. Get the `payload_size` column and select it and label it as a separate pandas `Series`
1. Run both the `min()` method and the `max()` method on your `Series`
1. Calculate the difference between the two values you get back

Execute your script from the **terminal/command line** using the command:

```bash
ipython -i my_DataFrame_01.py
```

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
run soln_dataframe_01.py

# Experience Points!
---

In your **text editor** create a simple script called:

```bash
my_DataFrame_02.py
```

Then do the following

1. Create a pandas dataFrame called `na_log` by reading in the csv `log_file_na.csv` and using the names:
    1. `['name', 'email', 'fm_ip', 'to_ip', 'date_time', 'lat', 'long', 'payload_size']`
1. Print to the screen the content of the two columns: `long` and `lat`
1. Create a new Series that is made up of the **difference** between the value of `long` and the value of `lat`
1. Apply the `round()` function to the Series so that each value is rounded to the nearest full integer
1. Use the `.unique()` method to print out all of the unique values

Execute your script from the **terminal/command line** using the command:

```bash
ipython -i my_DataFrame_02.py
```

When you complete this exercise, please put your green post-it on your monitor. 

If you want to continue on at your own-pace, please feel free to do so.

<img src='../universal_images/green_sticky.300px.png' width='200' style='float:left'>

In [None]:
(logs.lat - logs.long).apply(round).value_counts()

# BACKUP...