## Pandas Practice

In [1]:
import numpy as np
import pandas as pd
from pydataset import data

#### pandas Series
- a Series is essentially a single column with an index
- if data is of mixed types, pandas will force it into a single type
- can generate a series from:
    1. a list
    2. a numpy array
    3. a dictionary
    4. a pandas DataFrame

#### Creating a Series from a list

In [None]:
my_list = [2, 3, 5]
type(my_list)

In [4]:
# a python list has an index, but it is a set of fixed integers
my_list[0]

2

In [5]:
# a Series index can be a name, datetime, etc

In [6]:
# convert my_list to a Series
my_series = pd.Series(my_list)
type(my_series)

pandas.core.series.Series

In [7]:
# what's inside a pandas Series?
my_series

0    2
1    3
2    5
dtype: int64

- Index: the left column is the Series index
- Values: the right column is the Series data or values
- Data type: the dtype here is a 64-bit integer

#### Creating a Series from a numpy array

In [None]:
my_array = np.array([8.0, 13.0, 21.0])
print(type(my_array), my_array)

In [10]:
# convert to a Series
my_series = pd.Series(my_array)
type(my_series)

pandas.core.series.Series

In [11]:
my_series

0     8.0
1    13.0
2    21.0
dtype: float64

#### Creating a Series from a Dictionary

In [21]:
labeled_series = pd.Series({'a': 0, 'b': 1.5, 'c': 2, 'd': 3.5, 'e': 4, 'f': 5.5})

In [22]:
labeled_series

a    0.0
b    1.5
c    2.0
d    3.5
e    4.0
f    5.5
dtype: float64

#### Creating a Series from a DataFrame
- data from pydataset imports various datasets, here, sleepstudy data

In [18]:
sleep_df = data('sleepstudy')
sleep_df.head()

Unnamed: 0,Reaction,Days,Subject
1,249.56,0,308
2,258.7047,1,308
3,250.8006,2,308
4,321.4398,3,308
5,356.8519,4,308


In [16]:
# two sytax options to choose a column from this DataFrame:
# 1) df.columnname  2) df['columnname']
# DataFrame's index is retained for the Series
reaction_series = sleep_df.Reaction
type(reaction_series)

pandas.core.series.Series

In [17]:
days_series = sleep_df['Days']
type(days_series)

pandas.core.series.Series

In [23]:
# note that I could've created a single column DataFrame instead by using [[]]
df_that_resembles_a_series = sleep_df[['Days']]
type(df_that_resembles_a_series)

pandas.core.frame.DataFrame

In [25]:
# note that the printed output is formatted differently than a Series
df_that_resembles_a_series

Unnamed: 0,Days
1,0
2,1
3,2
4,3
5,4
...,...
176,5
177,6
178,7
179,8


In [26]:
days_series

1      0
2      1
3      2
4      3
5      4
      ..
176    5
177    6
178    7
179    8
180    9
Name: Days, Length: 180, dtype: int64

#### Data types for Series or DataFrames:
- `int` integers
- `float` floating point decimal numbers
- `bool` booleans -- True or False values
- `object` the category for Python strings
- `category` a fixed set of string values
- a name. an optional human-friendly name for the series
- inferring
- using `astype()`

#### Inferring
- pandas will infer the datatype when it can, e.g.

In [28]:
# here it recognizes that I fed it a list of Booleans
pd.Series([True, False, True])

0     True
1    False
2     True
dtype: bool

In [29]:
# here it inputs a list of strings as object type
pd.Series(['I', 'love', 'lamp'])

0       I
1    love
2    lamp
dtype: object

In [30]:
# here it sees a mixed type list and chooses object type since a series
# must have a single datatype
my_series = pd.Series([1, 3, 'five'])
my_series

0       1
1       3
2    five
dtype: object

In [31]:
# filter out 'five' from the Series and reassign
my_new_series = my_series[my_series != 'five']
my_new_series

0    1
1    3
dtype: object

In [33]:
# now that we have only integers in the Series, we can cast to an 
# int datatype using the .astype() method:
my_new_series.astype('int')

0    1
1    3
dtype: int64

In [35]:
# when we try this on the original Series it doesn't work
# my_series.astype('int')
# uncomment to see error message

In [37]:
# The Subject column in sleep study is actually an object representing a subject
# makes sense to cast it as an 'object' (however, the argument for this is 'str')
sleep_subj_series = sleep_df['Subject'].astype('str')
sleep_subj_series

1      308
2      308
3      308
4      308
5      308
      ... 
176    372
177    372
178    372
179    372
180    372
Name: Subject, Length: 180, dtype: object

#### Vectorized Series Operations
- since a pandas Series is just a numpy array with an index bolted on, we can use arithmetic and comparison operator in the same way.

In [38]:
fibi_series = pd.Series([0, 1, 1, 2, 3, 5, 8])

fibi_series.head()

0    0
1    1
2    1
3    2
4    3
dtype: int64

In [39]:
fibi_series + 1

0    1
1    2
2    2
3    3
4    4
5    6
6    9
dtype: int64

In [40]:
fibi_series / 2

0    0.0
1    0.5
2    0.5
3    1.0
4    1.5
5    2.5
6    4.0
dtype: float64

In [41]:
# note you can see that the underlying Series values didn't change
fibi_series

0    0
1    1
2    1
3    2
4    3
5    5
6    8
dtype: int64

In [42]:
fibi_series >= 5

0    False
1    False
2    False
3    False
4    False
5     True
6     True
dtype: bool

In [43]:
(fibi_series >= 3) & (fibi_series % 2 == 0)

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

#### Series 'Attributes'
- attributes are a sort of metadata about pandas objects
- they are easily accessible using .notation withouth ()(unlike .methods())
- in Jupyter, can see a list of attributes by pressing `tab` after `seriesname.`
- `.index` allows us to reference the Series index
- `.values` allows us to reference the values of the Series
- `.dtype` allows us to reference the datatype of the Series
- `.size` returns an int of the number of rows. NULL vals ARE included.
- `.shape` returns a tuple representing rows, columns. NULL vals ARE included.

In [50]:
fibi_series.dtype

dtype('int64')

In [47]:
fibi_series.index

RangeIndex(start=0, stop=7, step=1)

In [49]:
# note that the values are literally a numpy array
fibi_series.values

array([0, 1, 1, 2, 3, 5, 8])

#### Series Methods
- methods used on pandas Series often return new Series objects
- most methods have default parameters to not overwrite original vals (inplace=True)

- `.head()` returns the 1st 5 rows of the series
- `.tail()` returns the last 5
- 5 is the default, can adjust with an int in the arguments i.e. `.head(10)`

In [52]:
fibi_series.head(), fibi_series.tail(2)

(0    0
 1    1
 2    1
 3    2
 4    3
 dtype: int64,
 5    5
 6    8
 dtype: int64)

- `.sample()` returns a random sample of rows in the Series
- n=1 by default, can put in any int
- can exceed number of rows with argument replace=True

In [56]:
fibi_series.sample(), fibi_series.sample(3), fibi_series.sample(20, replace=True)

(2    1
 dtype: int64,
 4    3
 1    1
 5    5
 dtype: int64,
 5    5
 5    5
 6    8
 6    8
 2    1
 2    1
 0    0
 5    5
 3    2
 2    1
 5    5
 6    8
 6    8
 2    1
 2    1
 0    0
 0    0
 4    3
 4    3
 1    1
 dtype: int64)

- `.value_counts()` counts number of records/items/rows containing each unique value (think 'group by' from MySQL)

In [57]:
sleep_days_series = sleep_df.Days
sleep_days_series.value_counts()

0    18
1    18
2    18
3    18
4    18
5    18
6    18
7    18
8    18
9    18
Name: Days, dtype: int64

- In MySQL this would look like:

  SELECT Days, COUNT(Subject)<br>    FROM my_df<br>    GROUP BY Days;

#### Descriptive Stats in pandas
- pandas has a number of methods that can be used to view summary stats about our data. Below are some of the most common:
|  Function	 |   Description   |
|:------------|:--------------------------------|
|count|	Number of non-NA observations|
|sum|	Sum of values|
|mean|	Mean of values|
|median|	Arithmetic median of values|
|min|	Minimum|
|max|	Maximum|
|mode|	Mode|
|abs|	Absolute Value|
|std|	Bessel-corrected sample standard deviation|
|quantile|	Sample quantile (value at %)|

In [58]:
sleep_reaction_time = sleep_df.Reaction

In [59]:
{
    'count': sleep_reaction_time.count(),
    'sum': sleep_reaction_time.sum(),
    'mean': sleep_reaction_time.mean(),
    'median': sleep_reaction_time.median()
}

{'count': 180,
 'sum': 53731.42049999999,
 'mean': 298.50789166666664,
 'median': 288.6508}

- `.describe()` returns a series of descriptive stats on a Series (or DataFrame)
- what it returns depends on the dtype of the values

In [60]:
sleep_reaction_time.describe()

count    180.000000
mean     298.507892
std       56.328757
min      194.332200
25%      255.375825
50%      288.650800
75%      336.752075
max      466.353500
Name: Reaction, dtype: float64

In [61]:
print(fibi_series)
fibi_series.describe()

0    0
1    1
2    1
3    2
4    3
5    5
6    8
dtype: int64


count    7.000000
mean     2.857143
std      2.794553
min      0.000000
25%      1.000000
50%      2.000000
75%      4.000000
max      8.000000
dtype: float64

- `.nlargest()` returns n largest values in a Series
- `.nsmallest()` returns n smallest values in a Series
- defaults to 5, but can add `n=an_integer`
- for duplicates, can add argument `keep='first', 'last', or 'all'`

In [65]:
fibi_series.nlargest(n=3, keep='first')

6    8
5    5
4    3
dtype: int64

In [68]:
fibi_series.nsmallest(n=2, keep='all')

0    0
1    1
2    1
dtype: int64

- `.sort_values()` sorts values defaulting to ascending=True
- `.sort_index()` sorts index defaulting to ascending=True

In [69]:
sleep_reaction_time.sort_index(ascending=False)

180    364.1236
179    369.1417
178    343.2199
177    334.4818
176    329.6076
         ...   
5      356.8519
4      321.4398
3      250.8006
2      258.7047
1      249.5600
Name: Reaction, Length: 180, dtype: float64