## Advanced DataFrames Practice

In [6]:
# imports
import numpy as np
import pandas as pd
from env import host, user, password, get_db_url

np.random.seed(123)

In [7]:
# Create list of values for names column.

students = ['Sally', 'Jane', 'Suzie', 'Billy', 'Ada', 'John', 'Thomas',
            'Marie', 'Albert', 'Richard', 'Isaac', 'Alan']

# Randomly generate arrays of scores for each student for each subject.
# Note that all the values need to have the same length here.

math_grades = np.random.randint(low=60, high=100, size=len(students))
english_grades = np.random.randint(low=60, high=100, size=len(students))
reading_grades = np.random.randint(low=60, high=100, size=len(students))


In [8]:
# Construct the DataFrame using the above lists and arrays.

df = pd.DataFrame({'name': students,
                   'math': math_grades,
                   'english': english_grades,
                   'reading': reading_grades,
                   'classroom': np.random.choice(['A', 'B'], len(students))})


In [9]:
url = get_db_url('employees')

In [11]:
# query employees table from employees database
emp_query = '''

SELECT * 
FROM employees e;'''


In [12]:
employees = pd.read_sql(emp_query, url)
employees.head()

Unnamed: 0,emp_no,birth_date,first_name,last_name,gender,hire_date
0,10001,1953-09-02,Georgi,Facello,M,1986-06-26
1,10002,1964-06-02,Bezalel,Simmel,F,1985-11-21
2,10003,1959-12-03,Parto,Bamford,M,1986-08-28
3,10004,1954-05-01,Chirstian,Koblick,M,1986-12-01
4,10005,1955-01-21,Kyoichi,Maliniak,M,1989-09-12


In [13]:
tit_query = '''

SELECT * 
FROM titles'''

In [14]:
titles = pd.read_sql(tit_query, url)

In [15]:
titles.head()

Unnamed: 0,emp_no,title,from_date,to_date
0,10001,Senior Engineer,1986-06-26,9999-01-01
1,10002,Staff,1996-08-03,9999-01-01
2,10003,Senior Engineer,1995-12-03,9999-01-01
3,10004,Engineer,1986-12-01,1995-12-01
4,10004,Senior Engineer,1995-12-01,9999-01-01


In [17]:
employees.shape

(300024, 6)

In [18]:
titles.shape

(443308, 4)

In [19]:
# there are a lot more titles than employees. This makes sense if many
# employees have changed titles during their tenure

In [23]:
employees.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300024 entries, 0 to 300023
Data columns (total 6 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   emp_no      300024 non-null  int64 
 1   birth_date  300024 non-null  object
 2   first_name  300024 non-null  object
 3   last_name   300024 non-null  object
 4   gender      300024 non-null  object
 5   hire_date   300024 non-null  object
dtypes: int64(1), object(5)
memory usage: 13.7+ MB


In [24]:
titles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 443308 entries, 0 to 443307
Data columns (total 4 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   emp_no     443308 non-null  int64 
 1   title      443308 non-null  object
 2   from_date  443308 non-null  object
 3   to_date    443308 non-null  object
dtypes: int64(1), object(3)
memory usage: 13.5+ MB


In [26]:
# how many unique titles are there? 7
titles.title.nunique()

7

In [29]:
# the oldest date in the to_date column is 3/1/1985
# it appears that the protocol is to use 1/1/9999 as a proxy for 'presently' or
# some future time
titles.to_date.min(), titles.to_date.max()

(datetime.date(1985, 3, 1), datetime.date(9999, 1, 1))

#### Indexing and Subsetting
- Like the pandas Series object, the pandas DataFrame object supports both position- and label-based indexing using the indexing operator [].
- I will demonstrate concrete examples of indexing using the indexing operator [] alone and with the .loc and .iloc attributes below.


In [30]:
# Choose only two columns for my subset.

df[['name', 'classroom']]



Unnamed: 0,name,classroom
0,Sally,A
1,Jane,B
2,Suzie,A
3,Billy,B
4,Ada,A
5,John,B
6,Thomas,A
7,Marie,A
8,Albert,A
9,Richard,A


In [31]:
# can pass a boolean Series to the indexing operator as a selector
bools = df.name.str.startswith('A')
bools

0     False
1     False
2     False
3     False
4      True
5     False
6     False
7     False
8      True
9     False
10    False
11     True
Name: name, dtype: bool

In [32]:
df[bools]

Unnamed: 0,name,math,english,reading,classroom
4,Ada,77,92,98,A
8,Albert,92,62,87,A
11,Alan,92,62,72,A


- We can use the .loc attribute to select specific rows AND columns by index label. The index label can be a number, but it can also be a string label. This method offers a lot of flexibility! The .loc attribute's indexing is inclusive and uses an index label, not position.

 - this looks like `df.loc[row_indexer, column_indexer]` in general form

In [34]:
# select all the rows and a subset of the columns. Note .loc is inclusive.
df.loc[:, 'math':'reading']

Unnamed: 0,math,english,reading
0,62,85,80
1,88,79,67
2,94,74,95
3,98,96,88
4,77,92,98
5,79,76,93
6,82,64,81
7,93,63,90
8,92,62,87
9,69,80,94


In [35]:
# I can use a boolean Series as a selector with .loc, too, but I can choose rows and columns.

df.loc[bools, 'name': 'reading']


Unnamed: 0,name,math,english,reading
4,Ada,77,92,98
8,Albert,92,62,87
11,Alan,92,62,72


- We can use the `.iloc` attribute to select specific rows and colums by index position. .iloc does not accept a boolean Series as a selector like `.loc` does. It takes in integers representing index position and is NOT inclusive.
-  basic syntax: `df.iloc[row_indexer, column_indexer]`




In [36]:
# Notice the exclusive behavior of the indexing.

df.iloc[:3]


Unnamed: 0,name,math,english,reading,classroom
0,Sally,62,85,80,A
1,Jane,88,79,67,B
2,Suzie,94,74,95,A


In [38]:
# rows 0, 1, 2 and columns 1 and 2 (excluding 0, 3, and 4)
df.iloc[:3, 1:3]


Unnamed: 0,math,english
0,62,85
1,88,79
2,94,74


#### Aggregating
- The `.agg` method lets us specify a way to aggregate a series of numerical values. We pass an aggregate function or list of functions to the method that we want applied to a Series.



In [40]:
# can pass lists of columns to the indexer and a list of aggregation functions to .agg
df[['english', 'reading', 'math']].agg(['mean', 'min', 'max'])


Unnamed: 0,english,reading,math
mean,77.666667,86.5,84.833333
min,62.0,67.0,62.0
max,99.0,98.0,98.0


#### .groupby

The `.groupby()` method is used to create a grouped object, which we can then apply an aggregation on. For example, if we wanted to know the highest math grade from each classroom:

In [42]:
df.groupby('classroom').math.max()

classroom
A    94
B    98
Name: math, dtype: int64

- We can group by multiple columns as well. To demonstrate, we'll create a boolean column named passing_math, then group by the combination of our new feature, passing_math, and the classroom and calculate the average reading grade and the number of individuals in each subgroup.



`np.where()` we can create a column based on a condition using np.where()
- general syntax: `np.where(condition, this_where_True, this_where_False)`


In [43]:
df['passing_math'] = np.where(df.math < 70, 'failing', 'passing')

In [44]:
df.head()

Unnamed: 0,name,math,english,reading,classroom,passing_math
0,Sally,62,85,80,A,failing
1,Jane,88,79,67,B,passing
2,Suzie,94,74,95,A,passing
3,Billy,98,96,88,B,passing
4,Ada,77,92,98,A,passing


In [50]:
grade_groups = df.drop(columns='name').groupby(['passing_math', 'classroom']).reading.agg(['mean', 'count'])
grade_groups

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,count
passing_math,classroom,Unnamed: 2_level_1,Unnamed: 3_level_1
failing,A,87.0,2
passing,A,87.166667,6
passing,B,85.25,4


In [51]:
# I can even clean up my columns to make my calculations clearer.

grade_groups.columns = ['avg_reading_grade', 'count_of_students']
grade_groups


Unnamed: 0_level_0,Unnamed: 1_level_0,avg_reading_grade,count_of_students
passing_math,classroom,Unnamed: 2_level_1,Unnamed: 3_level_1
failing,A,87.0,2
passing,A,87.166667,6
passing,B,85.25,4


`.transform()`
- The .transform method can be used to produce a series with the same length of the original dataframe where each value represents the aggregation from the subgroup resulting from the .groupby.
- This is great when we want to create a new column for the original df with aggregated group data for each individual record. 




In [52]:
df.assign(avg_math_score_by_classroom=df.groupby('classroom').math.transform('mean'))

Unnamed: 0,name,math,english,reading,classroom,passing_math,avg_math_score_by_classroom
0,Sally,62,85,80,A,failing,82.625
1,Jane,88,79,67,B,passing,89.25
2,Suzie,94,74,95,A,passing,82.625
3,Billy,98,96,88,B,passing,89.25
4,Ada,77,92,98,A,passing,82.625
5,John,79,76,93,B,passing,89.25
6,Thomas,82,64,81,A,passing,82.625
7,Marie,93,63,90,A,passing,82.625
8,Albert,92,62,87,A,passing,82.625
9,Richard,69,80,94,A,failing,82.625


`.describe()`
- we can chain a .describe() onto a groupby to get summary statistics for the grouped data

In [54]:
df.groupby('classroom').reading.describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
classroom,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,8.0,87.125,8.88719,72.0,80.75,88.5,94.25,98.0
B,4.0,85.25,12.392874,67.0,82.75,90.5,93.0,93.0


#### Merging and Joining
- Pandas provides several ways to combine dataframes together. We will look at two of them below:



`pd.concat()`
- This function takes in a list or dictionary of Series or DataFrame objects and joins them along a particular axis, row-wise axis=0 or column-wise axis=1.



- Default is set to row-wise concatenation using an outer join.

`pd.concat(objs, axis=0, join='outer')`


- When concatenating dataframes vertically, we basically are just adding more rows to an existing dataframe. In this case, the dataframes we are putting together should have the same column names.



In [55]:
df1 = pd.DataFrame({'a': [1, 2, 3]})
df2 = pd.DataFrame({'a': [4, 5, 6]})

df1


Unnamed: 0,a
0,1
1,2
2,3


In [56]:
df2

Unnamed: 0,a
0,4
1,5
2,6


In [57]:
pd.concat([df1, df2])

Unnamed: 0,a
0,1
1,2
2,3
0,4
1,5
2,6


Note that the indices are preserved on the resulting dataframe; we could set the ignore_index parameter to True if we wanted these to be sequential.



In [58]:
concat_df1 = pd.concat([df1, df2], ignore_index=True)
concat_df1

Unnamed: 0,a
0,1
1,2
2,3
3,4
4,5
5,6


In [59]:
concat_df2 = pd.DataFrame({'b': [1, 2, 3, 4, 5, 6]})
concat_df2


Unnamed: 0,b
0,1
1,2
2,3
3,4
4,5
5,6


In [60]:
pd.concat([concat_df1, concat_df2], axis=1)


Unnamed: 0,a,b
0,1,1
1,2,2
2,3,3
3,4,4
4,5,5
5,6,6


`.merge()`
- This method is similar to a SQL join. Here's a [cool read](https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join) making a comparison between the two, if you're interested.

- In addition, the how keyword argument is used to define what type of JOIN we want to do; as we saw above, inner is the default setting.

`# df.merge default settings for commonly used parameters.`

`left_df.merge(right_df, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, indicator=False)`

How does changing the default argument of the how parameter change my resulting DataFrame?

how == Type of merge to be performed.

`how=left`: use only keys from left frame, similar to a SQL left outer join; preserve key order.

`how=right`: use only keys from right frame, similar to a SQL right outer join; preserve key order.

`how=outer`: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

`how=inner`: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.



In [62]:
# Create the users DataFrame.

users = pd.DataFrame({
    'id': [1, 2, 3, 4, 5, 6],
    'name': ['bob', 'joe', 'sally', 'adam', 'jane', 'mike'],
    'role_id': [1, 2, 3, 3, np.nan, np.nan]
})
users


Unnamed: 0,id,name,role_id
0,1,bob,1.0
1,2,joe,2.0
2,3,sally,3.0
3,4,adam,3.0
4,5,jane,
5,6,mike,


In [63]:
# Create the roles DataFrame

roles = pd.DataFrame({
    'id': [1, 2, 3, 4],
    'name': ['admin', 'author', 'reviewer', 'commenter']
})
roles


Unnamed: 0,id,name
0,1,admin
1,2,author
2,3,reviewer
3,4,commenter


`.merge()` will allow us to specify left_on and right_on to indicate the columns that are the keys used to merge the dataframes together.

- In addition, the how keyword argument is used to define what type of JOIN we want to do; as we saw above, inner is the default setting.
- For demonstration purposes, setting the indicator parameter to True, which will create a column indicating whether the merge key appears in the left_only, right_only or both DataFrames.

In [64]:
# Perform an outer join specifying the left and right DataFrame keys.

users.merge(roles, left_on='role_id', right_on='id', how='outer', indicator=True)


Unnamed: 0,id_x,name_x,role_id,id_y,name_y,_merge
0,1.0,bob,1.0,1.0,admin,both
1,2.0,joe,2.0,2.0,author,both
2,3.0,sally,3.0,3.0,reviewer,both
3,4.0,adam,3.0,3.0,reviewer,both
4,5.0,jane,,,,left_only
5,6.0,mike,,,,left_only
6,,,,4.0,commenter,right_only


- Notice that we have duplicate column names in the resulting dataframe. By default, pandas will add a suffix of _x to any columns in the left dataframe that are duplicated, and _y to any columns in the right dataframe that are duplicated. I can clean up my columns if I want to; one way would be to use method chaining, which it demonstrated below:



In [65]:
(users.merge(roles, 
            left_on='role_id', 
            right_on='id', 
            how='outer')
    .drop(columns='role_id')
    .rename(columns={'id_x': 'id', 
                     'name_x': 'employee',
                     'id_y': 'role_id',
                     'name_y': 'role'}
            )
)


Unnamed: 0,id,employee,role_id,role
0,1.0,bob,1.0,admin
1,2.0,joe,2.0,author
2,3.0,sally,3.0,reviewer
3,4.0,adam,3.0,reviewer
4,5.0,jane,,
5,6.0,mike,,
6,,,4.0,commenter


## Exercises II

1. Copy the users and roles DataFrames from the examples above.



2. What is the result of using a right join on the DataFrames?

