# Lab 2 - Basic data cleaning and analysis
*© 2022 Colin Conrad*

Welcome to Week 2 of ECMM 6014! Last week we explored two key elements of the Python programming lanaguage: loops and functions. These skills are the essential building blocks of virtually all work in data science. This week we will explore two basic data structures: lists and dictionaries. This week's case is built on data from rental data in Halifax from the Canada Mortgage and Housing Corporation (CMHC) and may actually provide you with useful insight in your search for housing. 

**This week, we will achieve the following objectives:**
- Make and manage a lists
- Use a function to interpret list data
- Assess list data quality
- Make and manage a dictionary
- Analyze where rent is growing fastest

Reference: Sweigart (2014) Ch. 4 and 5. [Sweigart (2020)](https://automatetheboringstuff.com/).

# Case: Canada Mortgage and Housing Corporation
The [Canada Mortgage and Housing Corporation (CMHC)](https://www.youtube.com/watch?v=vy19rwKFGYk#action=share) is a crown corporation with a mandate to assist housing for Canadians. Founded shortly after the second world war, its purpose was once to find housing for veterans. Now it provides programs for making mortgages more affordable and enforces policies designed to make rent more accessible. The CMHC also provides data on rental affordability, including [information on the average price of rent](https://www03.cmhc-schl.gc.ca/hmip-pimh/en#TableMapChart/0580/3/Halifax%20CMA) for various regions in Halifax through its information portal.

Like most information portals, CMHC's is built to make the most important information accessible. This has the unfortunate consequence of also making it difficult to retrieve the data that you need. To retrieve rental price data, you must first select your appropriate data boundary and then select `Primary Rental Market` and `Average Rent ($)`. You can then download data in csv format, similar to the `2021_10_Halifax_Rental.csv` provided to you in Brightspace.

The data provided by CMHC is not formatted in a way that is conducive to data science. They provide data in rows and columns, complete with notes at the bottom of the file. It would be desirable to prepare the data in a way that is appropriate for analysis. If we were able to do it in Python, we could sort through hundreds of CSV spreadsheets, such as the rental data for October of each preceding year.

# Objective 1: Make and manage lists
Lists are very important for data scientists-- in fact, this is the week when our Python tasks will start to feel a bit more like data science, specifically! Lists are data structures which consist of a series of values organized in a systematic way. For example, if we wanted to list the rental values for the South End of Halifax, as given in the 2021 CMHC rental data, we could create a list such as `peninsula_south` below.

In [1]:
# a list of rental prices

peninsula_south = [965, 1313, 1751, 2183] # corresponds to line 4 of the 2021 data

In some ways, lists are like the strings which we investigated the past few weeks. Lists are organized with indexes, similarly to the sequence of characters in strings. For example, if we wanted to retrieve a value for the first item in our list, we could retrieve it using the index. 

In [2]:
 peninsula_south[1] # retrieves the second value from the list

1313

Lists can also be subdivided using an index range, similarly to strings. For example, if we had the string `"The data scientist solved the problem"` and wanted to only retrieve the values `"The"`, we could specify the range `[0:3]`. Likewise, to retrieve values from a list, we can specify the range of items, such as with the code below.

In [3]:
peninsula_south[0:3] # returns a subdivided list of only the first three values

[965, 1313, 1751]

This said, lists can contain any type of data, such as strings, integers, floats ... even other lists! In this respect, they are very different. The code below demonstrates how lists can contain different types of values. Try printing the various list values by executing the cell.

In [4]:
list_example = ["Strings are cool!", 42, 42.56, ["Seriously, they are cool!", "You can even add lists within lists!"]]

for l in list_example:
    print(l)

Strings are cool!
42
42.56
['Seriously, they are cool!', 'You can even add lists within lists!']


### Looping through lists

In practical terms, the combination of loops and lists is very powerful. In data science, we could store a series of values in a list and use loops to iterate through each item in the list. Interestingly, the `for` loop is particularly well suited for this task. Consider the code below, which iterates through each r (short notation for rental value) in `peninsula_south` and checks whether it is greater than 1200, and if so, prints them.

In [5]:
for r in peninsula_south: # iterates through each list value
    if r > 1200:
        print(r)

1313
1751
2183


This creates some interesting dynamics for your code. As we saw last week, `while` loops will execute the code in the loop until the logical condition is met. If we wanted, we could write a while loop that iterates over a range of prespecified values. Consider the following code which accomplishes the same result as the for loop above.

In [6]:
i = 0

# iterates through the first four values
while i < 4:
    if peninsula_south[i] > 1200:
        print(peninsula_south[i])
    i += 1

1313
1751
2183


There are some situations where you may wish to use a `while` loop instead of a for loop such as when you wish to specify specific list items (e.g. only values 1, and 3, etc.). However, this also comes with a disadvantage: if you iterate through values that do not exist, your loop will crash! The following loop attempts to iterate through five values, though only four exist. The result is a crash.

In [7]:

i = 0

# tries to iterate through the first five values. There are only four values, so it crashes.
while i < 5:
    if peninsula_south[i] > 1200:
        print(peninsula_south[i])
    i += 1

1313
1751
2183


IndexError: list index out of range

Fortunately, we have the handy `len()` function which can be used to discover the length of a list. Execute the following cell to retrieve the length of `peninsula_south`.

In [8]:
len(peninsula_south) # returns the number of items in the list

4

When combined with `while` loops, the `len()` function allows us to specify only the length of the list values. It is a best practice to use this function when creating `while` lists to prevent crashes.

In [9]:
i = 0

# returns all of the values between i and the final list item

while i < len(peninsula_south):
    print(peninsula_south[i])
    i += 1

965
1313
1751
2183


### *Challenge Question 1 *
Many university interns in co-op positions in the university earn 20 per hour, which translates to approximately 3200 per month (gross). Financial advisors often recommend that people spend no more than 30\% of their gross income on housing, which in the case of co-op students, would translate to 1060 per month. It could be desirable to know which of the living options are below this threshold. 

Create a new list called `peninsula_north` which contains the values for the Bachelor, 1 Bedroom, 2 Bedroom and 3 Bedroom rental prices from the 2019 spreadsheet. Create a loop which checks each of the values and prints the value if it is less than 1060.

In [10]:
peninsula_north = [916,1104,1333,1536] # a list of rental prices for peninsula_north

for r in peninsula_north:
    if r <= 1060:
        print(r)

916


# Objective 2: Use a function to interpret list data
Last week we started to use functions to interpret data. It is probably no surprise to you that functions are also extremely useful for managing structured data, such as that contained in lists. Though there are many possibilities, we wile explore two ways that functions are typically used with respect to list data.

### ...functions can be used when iterating though lists 
Functions are often used to simplify your code so that you don't have to retype the code over and over. You can create functions that are used for iterating through lists and performing some sort of logic on each list item. For example, check out `defRent()` below, which checks a rent value that is given as input and determines whether the rent is affordable.

In [11]:
def checkRent(rent): # takes a rent value as input
    if rent < 1200: #checks whether the values is less than 877
        print('This is affordable.')
    else:
        print('Not affordable.')

We could use this function to create a simple while loop which checks the rent for each value in the list. The following code does this on the `peninsula_north` list.

In [12]:
i = 0

while i < len(peninsula_north): # iterate through each list value
    checkRent(peninsula_north[i]) # execute the checkRent function
    i += 1

This is affordable.
This is affordable.
Not affordable.
Not affordable.


Alternatively, we could further simplify this using a `for` loop. Consider the following code which does the same thing on `peninsula_south`. Consider changing it to check if it also works with the north!

In [13]:
for r in peninsula_south: # loop through the values in peninsula_south
    checkRent(r) # execute the function

This is affordable.
Not affordable.
Not affordable.
Not affordable.


### ... functions can be used to process lists
Additionally, functions can also take lists as input and process them and perform logic to them! For example, it could be desirable to determine whether the values of many lists meet our requirements. By taking a list as input, we could easily process both `peninsula_north` and `peninsula_south`. 

The `explainRent()` function below takes a `region_list` as an input and uses a loop to determine whether the values in the list are affordable.

In [14]:
i = 0

def explainRent(region_list): # takes a list as input
    i = 0
    while i < len(region_list): # iterate through each value in the list until it reaches the length of the list
        if i == 0:
            print ("A Bachelor apartment in this region costs " + str(region_list[i]) + " on average")
        else:
            print("A " + str(i) + " bedroom apartment in this region costs " + str(region_list[i]) + " on average")
        i += 1

We can now use the function to crunch through the `peninsula_south` list in one line! Consider modifying the code to try it for the North End as well.

In [15]:
explainRent(peninsula_south) # execute the function

A Bachelor apartment in this region costs 965 on average
A 1 bedroom apartment in this region costs 1313 on average
A 2 bedroom apartment in this region costs 1751 on average
A 3 bedroom apartment in this region costs 2183 on average


Finally, functions can take multiple inputs, in addition to lists. For example, rather than predefining the value that we would like to check, we could take the value as an input, giving us more flexibility. The `assessRetn()` function below similarly takes a `region_list` but also takes in a `threshold` which it uses to compare. This is very handy for calculating whether rent is affordable for various income levels!

In [16]:
def assessRent(region_list, threshold): # takes two inputs
    i = 0
    while i < len(region_list): # same as before, keeps running if i is less than the length of the region lsit
        if region_list[i] < threshold: # checks whether the value is less than the threshold
            if i == 0:
                print ("A Bachelor apartment in this region costs " + str(region_list[i]) + " on average, which is affordable")
            else:
                print("A " + str(i) + " bedroom apartment in this region costs " + str(region_list[i]) + " on average, which is affordable")
        i += 1

Try executing the function below with different income levels (e.g. 1200). You will see how the output changes.

In [17]:
assessRent(peninsula_north, 1200)

A Bachelor apartment in this region costs 916 on average, which is affordable
A 1 bedroom apartment in this region costs 1104 on average, which is affordable


### *Challenge Question 2 *
The `assessRent()` function above is very handy. The only limitation is that it produces a generic response. Modify the code to do the following: 

* Take a third input called `region_name`, which is a string
* Append the appropriate string contained in `region_list` to your print statement to give a context relevant response
* Two test scenarios are provided to test your code

In [19]:
def assessRent(region_list, threshold, region_name): 
    i = 0
    while i < len(region_list): 
        if region_list[i] < threshold: 
            if i == 0:
                print ("A Bachelor apartment in "+ region_name + " costs " + str(region_list[i]) + " on average, which is affordable")
            else:
                print("A " + str(i) + " bedroom apartment in " + region_name + " costs " + str(region_list[i]) + " on average, which is affordable")
        i += 1

#### Sample Test 1
Should return `"A Bachelor apartment in this region costs 916 on average, which is affordable"`

In [20]:
assessRent(peninsula_north,1100,"Peninsula North")

A Bachelor apartment in Peninsula North costs 916 on average, which is affordable


#### Sample Test 2
Should return: 

`"A Bachelor apartment in this region costs 888 on average, which is affordable"`

`"A 1 bedroom apartment in this region costs 1203 on average, which is affordable"`

In [21]:
assessRent(peninsula_south, 1300, "peninsula_south")

A Bachelor apartment in peninsula_south costs 965 on average, which is affordable


# Objective 3: Assess list data quality
So far, we have used relatively clean data in our analysis. However, the majority of data that you will encounter starting next week will be... less than idea. If we observe the `Halifax_Rental.csv` files, it will become clear that the CSV rows actually consist of data assessment characters following each value. Our data should actually look more similarly to that provided in `mainland_south`.

In [22]:
mainland_south = [834,"c",835,"a",1131,"b",1270,"b"] #corresponds to line 6 of the 2021 rental data

Fortunately, list data types come with a few handy methods for solving data cleaning problems. Sweigart's Chapter 4 goes into a lot of detail (perhaps *too* much detail this time) about handy list methods, which you should read through. We will highlight a few of them here however.

The index method will tell you the *first* instance of a specified value in a list. This can be very handy for figuring out where the value sits on the list so that we can modify or remove it. 

In [23]:
mainland_south.index("a")

3

The `del` method is used to removing list values. Now that we know that `mainland_south[1]` is an inappropriate character, we can remove it using `del`. If we would like to use this to remove the character `"b"` we could do something like the following.

In [24]:
del mainland_south[2]

mainland_south

[834, 'c', 'a', 1131, 'b', 1270, 'b']

Alternatively, we could use the `remove` method to clean out particular values. The code below removes the first instance of the letter `a`. However, it only removes it once, so we would have to run this 3 times to get the list the way that we would like. There are more efficient ways to do this, though that may be a story for another day. 

In [25]:
# run me 3 times to clear the bad data!
mainland_south.remove("b")
mainland_south

[834, 'c', 'a', 1131, 1270, 'b']

Finally, it is helpful to insert values into a list. The most common method used for this task is `append` which adds the value to the end of the list. The following code appends the total average value for `Mainland South` to the list. Pretty handy!

In [26]:
mainland_south.append(779)
mainland_south

[834, 'c', 'a', 1131, 1270, 'b', 779]

Alternatively, the `insert` function can be used to accomplish this task. You can read more about it in Sweigart Chapter 4.

### *Challenge Question 3 (2 points)*
The `sackville` list below contains a value of `"**"`. In the cell below, create code that changes this value to the string `"Insufficient data"`. _Note: Though you could simply re-write this list, such answers will not be accepted. Your code must remove, insert or change the value in the list below!_ 

In [27]:
sackville = ["**", 931, 1098, 1288, 1101]

sackville.remove("**")
sackville.insert(0,'Insufficient Data')
        



#### Sample Test 1
Should return `['Insufficient data', 931, 1098, 1288, 1101]`.

In [28]:
sackville

['Insufficient Data', 931, 1098, 1288, 1101]

# Objective 4: Make and manage a dictionary
In addition to lists, there is a second data structure that is commonly used in Python: dictionaries. Unlike lists which use the list order to determine the sequence of values, dictionaries use a key-value pair structure. There are no "first" items in a dictionary; instead, all of the values stored are mapped with keys.

Sometimes it is better to simply see things in action. The data for `peninsula_south` have been re-written into a dictionary, this time with keys (i.e. `Bach`, `1Bdr`) mapping to their respective values (i.e. `851`, `1093`). Try running the code below to set up your dictionary.

In [29]:
peninsula_south = {'Bach': 965, '1Bdr': 1313, '2Bdr': 1751, '3Bdr': 2183}

To retrieve a dictionary value, we simply need to specify the key that we are looking for! The line below retrieves the value for bachelor apartments. Consider modifying it to retrieve 1 bedrooms.

In [30]:
peninsula_south['Bach']

965

In [31]:
keys = ['Bach','1Bdr','2Bdr','3Bdr']

for i in keys:
    print(peninsula_south[i])

965
1313
1751
2183


Similarly to lists, dictionaries can be taken as inputs to functions or can be iterated through using key-value pairs.

In [32]:
def assessRent(region_dictionary,key):
    for i in region_dictionary.keys():
        if str(i)==key:
            if region_dictionary[i] < 1500:
                print (str(i) + " in this region costs " + str(region_dictionary[i]) + " and is affordable")
            else:
                print(str(i) +" bedroom apartment in this region costs " + str(region_dictionary[i]) + " and is unaffordable")


In [33]:
assessRent(peninsula_south, '3Bdr')

3Bdr bedroom apartment in this region costs 2183 and is unaffordable


### *Challenge Question 4 (2 points)*
Create function called `assessRent()` which does the following:
* Takes three inputs: 
    * region_dictionary (dictionary)
    * key (string)
    * threshold (integer)
* Assesses whether the inputted key-dictionary pair is less than threshold
* Returns the value, and whether the apartment was affordable

In [34]:
def assessRent(region_dictionary, key, threshold):
    for i in region_dictionary.keys():
        if str(i)==key:
            if region_dictionary[i] < threshold:
                print (str(i) + " in this region costs " + str(region_dictionary[i]) + " and is affordable")
            elif region_dictionary[i] > threshold:
                print (str(i) + " apartment in this region costs " + str(region_dictionary[i]) + " and is unaffordable")
    

In [35]:
assessRent(peninsula_south, '1Bdr', 1400)

1Bdr in this region costs 1313 and is affordable


#### Sample Test 1
Should return `"1Bdr in this region costs 1104 and is unaffordable for this person."`.

In [36]:
peninsula_north = {'Bach': 916, '1Bdr': 1104, '2Bdr': 1333, '3Bdr': 1536} # values extracted from the spreadsheet

assessRent(peninsula_north, '1Bdr', 1100)

1Bdr apartment in this region costs 1104 and is unaffordable


#### Sample Test 2
Should return `"2Bdr in this region costs 1333 and is affordable for this person."`

In [37]:
peninsula_north = {'Bach': 916, '1Bdr': 1104, '2Bdr': 1333, '3Bdr': 1536} # values extracted from the spreadsheet '1Bdr': 951, '2Bdr': 1189, '3Bdr': 1408} # values extracted from the spreadsheet

assessRent(peninsula_north, '2Bdr', 2000)

2Bdr in this region costs 1333 and is affordable


# Objective 5: Analyze where rent is growing fastest
Perhaps the most powerful feature of the dictionary data structure is that they can be nested within itself. Python dictionaries are structured similarly to JSON, and are designed to allow users to next dictionaries within other dictionaries. For example, we could store both the values for `Peninsula South` and `Peninsula North` inside of a larger `halifax` dictionary using the key-value architecture.

In [38]:
# 2018 data

halifax = {
    'Peninsula South' : {'Bach': 888, '1Bdr': 1203, '2Bdr': 1656, '3Bdr': 1949},
    'Peninsula North' : {'Bach': 754, '1Bdr': 951, '2Bdr': 1189, '3Bdr': 1408}
}

This allows us to navigate across many data values. In Python, you can navigate between nested keys by simply writing the nested key adjacent to the key from the first level. The code below will give you the value for the Bachelor apartments in the South End, though you can also use it to retrieve values such as 1 Bedrooms in the North end if you would like. Give it a try!

In [39]:
halifax['Peninsula South']['Bach']

888

If we wanted to, we could further expand our dictionary to encompass years. For example, the following dictionary contains 3 levels: years, locations, and apartment types. 

In [40]:
# corresponds to the 2020 and 2021 data from two CSV files

halifax_rentals = {
    '2020': {
        'Peninsula South': {'Bach': 930, '1Bdr': 1256, '2Bdr': 1756, '3Bdr': 2029},
        'Peninsula North' : {'Bach': 824, '1Bdr': 1036, '2Bdr': 1264, '3Bdr': 1476},
    },
    '2021': {
        'Peninsula South': {'Bach': 965, '1Bdr': 1313, '2Bdr': 1751, '3Bdr': 2183},
        'Peninsula North' : {'Bach': 916, '1Bdr': 1104, '2Bdr': 1333, '3Bdr': 1536},
      
    }
}

We can also navigate through this dictionary using the three levels of keys. The following code gives us the data about 1 bedroom apartment rentals in the north end in 2018.   

In [41]:
halifax_rentals['2020']['Peninsula North']['1Bdr']

1036

Similarly, if we wished to compare multiple years, we could simply print the values from multiple dictionary entries. The example below prints the 1 Bedroom apartment values in the north end from 2018 and 2019. 

_Note, this is actually one line of code that is broken apart for readability. You can break apart strings into multiple lines in Python_. 

In [42]:
print("Rentals on the peninsula north region for 1 Bdr were " 
      + str(halifax_rentals['2020']['Peninsula North']['1Bdr']) 
      + " in 2020 and " + str(halifax_rentals['2021']['Peninsula North']['1Bdr']) + " in 2021.")

Rentals on the peninsula north region for 1 Bdr were 1036 in 2020 and 1104 in 2021.


### *Challenge Question 5 (2 points)*
Create a function called compareRent which compares the growth of rent for a particular region and apartment pairing between two years. The function should do the following:
* Take five inputs:
    * data (dictionary)
    * region (string)
    * apartment (string)
    * year1 (string)
    * year2 (string)
* It should calculate the percent growth (year 2's value minus year 1's value divide by year 2's value)
* It should convert this calculated value into a percentage (by multiplying it by 100)
* The percentage should be rounded to two spaces
* The function should return a string that specifies the apartment type, region, and years

Pro Tip: When this function is done, you can use it to analyze apartments in regions that interset you!

In [43]:
halifax_rentals = {
    
    '2020': {
        'Peninsula South': {'Bach': 930, '1Bdr': 1256, '2Bdr': 1756, '3Bdr': 2029},
        'Peninsula North' : {'Bach': 824, '1Bdr': 1036, '2Bdr': 1264, '3Bdr': 1476},

    },
    '2021': {
        'Peninsula South': {'Bach': 965, '1Bdr': 1313, '2Bdr': 1751, '3Bdr': 2183},
        'Peninsula North' : {'Bach': 916, '1Bdr': 1104, '2Bdr': 1333, '3Bdr': 1536},
      
    }
}



def compareRent(data, region, apartment, year1, year2):
    year1 = halifax_rentals[year1][region][apartment]
    year2 = halifax_rentals[year2][region][apartment]
    
    growth = (year2 - year1)/year2
    pgrowth = round(growth * 100, 1)
    pgr = str(pgrowth)
    apt = str(apartment)
    rgn = str(region)
    print ("Rent for " + apt + " in " + rgn + " grew by " + pgr + "% between 2020 and 2021")
    
    
     
            
               
    

#### Sample Test 1
Should return `"Rent for 1Bdr in Mainland South grew by 4.3% between 2020 and 2021"`

In [44]:
compareRent(halifax_rentals, 'Peninsula South', '1Bdr', '2020', '2021')

Rent for 1Bdr in Peninsula South grew by 4.3% between 2020 and 2021


#### Sample Test 2
Should return `"Rent for 3Bdr in Peninsula South grew by 3.9% between 2018 and 2019"`

In [45]:
compareRent(halifax_rentals, 'Peninsula North', '3Bdr', '2020', '2021')

Rent for 3Bdr in Peninsula North grew by 3.9% between 2020 and 2021


# References
Canada Mortgage and Housing Corporation (19 January 2020). Housing Market Information Portal. Retrieved from: https://www03.cmhc-schl.gc.ca/hmip-pimh/en#TableMapChart/0580/3/Halifax%20CMA