## Week 10 Class activities
This notebook is a starting point for the exercises and activities that we'll do in class. 

We'll revisit our analysis of ADUs, and try and speed things up and reduce memory usage.

Then, we'll have some practice with SQL.

Before you attempt any of these activities, make sure to watch the video lectures for this week.

### Optimizing data types and parsing csv files
Let's revist the ADU data that we saw in Lecture 12 (classification). There are two csv files that we read in – one with the permit data, and one with the parcel data.

In [1]:
import pandas as pd
permits = pd.read_csv('../lectures/data/ADU_permits.csv')  # this file should be in your GitHub folder
permits.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15741 entries, 0 to 15740
Data columns (total 4 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Assessor Book                  15741 non-null  float64
 1   Assessor Page                  15741 non-null  float64
 2   Assessor Parcel                15741 non-null  object 
 3   # of Accessory Dwelling Units  15741 non-null  float64
dtypes: float64(3), object(1)
memory usage: 492.0+ KB


In [2]:
parcels = pd.read_csv('../lectures/data/parcels.csv')
parcels.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 789182 entries, 0 to 789181
Data columns (total 14 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   APN                789182 non-null  object 
 1   UseType            789140 non-null  object 
 2   UseDescription     789140 non-null  object 
 3   YearBuilt1         778940 non-null  float64
 4   Units1             778940 non-null  float64
 5   Bedrooms1          778940 non-null  float64
 6   Bathrooms1         778940 non-null  float64
 7   SQFTmain1          778940 non-null  float64
 8   Roll_LandValue     782704 non-null  float64
 9   Roll_ImpValue      782704 non-null  float64
 10  Roll_LandBaseYear  789182 non-null  int64  
 11  Roll_ImpBaseYear   789182 non-null  int64  
 12  CENTER_LAT         789182 non-null  float64
 13  CENTER_LON         789182 non-null  float64
dtypes: float64(9), int64(2), object(3)
memory usage: 84.3+ MB


For me, the permits data takes 492KB of memory. The parcels data uses 84MB. Imagine that you had a much bigger data file (perhaps more counties, or perhaps more columns). How can you reduce the memory requirements of these dataframes?

*Hint*: I didn't mention this in the lecture, but pandas has a new "nullable integer" datatype which can take missing values (NaNs).

To take advantage of this, use `Int16`, `Int32`, etc. rather than `int16`, `int32`, etc. The capital I is the nullable integer datatype; the lower-case i is the numpy datatype. [See here for a detailed explanation.](https://pandas.pydata.org/docs/user_guide/integer_na.html)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Reduce the memory usage of each dataframe as much as you can.
</div>

In [None]:
# your code here

Now, imagine that the dataframes wouldn't even load into memory in the first place?

Also imagine that you don't need the `Bathrooms1` column.

Use additional arguments to `pd.read_csv()` and the datatypes you used in the answer above to load in the dataframes again.

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> Load the dataframes from disk in the most memory-efficient way possible.
</div>

In [None]:
# your code here

### Speeding up joins

Now let's try and join them together in an efficient way.

Remember `%timeit` (for a line of code) and `%%timeit` (for a cell) will tell you how long your code takes to run.

Before we join, we need to drop the rows with no parcel number, and create a concatenated column for the APN.

In [3]:
permits = permits[permits['Assessor Parcel']!='***']
permits['APN'] = (permits['Assessor Book'].astype(int).astype(str).str.zfill(4) + '-' 
                   + permits['Assessor Page'].astype(int).astype(str).str.zfill(3) + '-'
                   + permits['Assessor Parcel'].astype(int).astype(str).str.zfill(3))
#drop the duplicates (take the first)
permits = permits.groupby('APN').first()
parcels = parcels.groupby('APN').first()

permits.head()

Unnamed: 0_level_0,Assessor Book,Assessor Page,Assessor Parcel,# of Accessory Dwelling Units
APN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-004-010,2004.0,4.0,10,1.0
2004-004-011,2004.0,4.0,11,1.0
2004-006-011,2004.0,6.0,11,1.0
2004-009-007,2004.0,9.0,7,1.0
2004-010-020,2004.0,10.0,20,1.0


This is how we joined the parcels in lecture.

In [4]:
joinedDf = parcels.join(permits, how='left') # left is the default so we could omit that argument

How long does that take?

In [5]:
%timeit joinedDf = parcels.join(permits, how='left')

57.5 ms ± 154 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


What if we use `pd.merge()` on a column, rather than the index? 

Let's first reset the indexes, so that `APN` becomes a regular column.

In [None]:
parcels.reset_index(inplace=True)
permits.reset_index(inplace=True)

<div class="alert alert-block alert-info">
<strong>Exercise:</strong> How long does it take to merge using the columns?</div>

In [None]:
# your code here

If you are curious, indexes are faster because they are [stored as hashes](https://stackoverflow.com/questions/27238066/what-is-the-point-of-indexing-in-pandas).

Before we move on, let's create the `has_adu` column as we did before.

In [None]:
joinedDf['has_adu'] = joinedDf['# of Accessory Dwelling Units']>=1

## SQL

Finally, let's do some simple SQL queries on the joined dataframe.

Implement the following queries as SQL, using `pandasql`.

1. What's the total land value of all the parcels? (The column `Roll_LandValue`.)

2. How many parcels have ADUs? 

3. What's the average area of the building size (`SQFTmain1`), with and without an ADU? (Hint: use `GROUP BY`.)

4. Same as above, but for residential parcels only (`UseType='Residential'`)

Compare your results to a typical `pandas` query.

In [None]:
from pandasql import sqldf

# your code here

<div class="alert alert-block alert-info">
<h3>You should now be able to:</h3>
<ul>
  <li>Deal with larger datasets on a standard laptop computer</li>
  <li>Profile your code to identify bottlenecks</li>
  <li>Write simple SQL queries</li>
  <li>Go forth and do data science!</li>
</ul>
</div>