# What Programming for Data Science is Like
*© 2023 Colin Conrad*

**In this notebook, we will achieve the following objectives:**
- Run `hello world`, a basic Python script
- Assign Python variables
- Perform basic string manipulation
- Turn a csv dataset into a dataframe and build a simple query
- Create an advanced query
- Collect descriptive statistics from your dataframe
- Visualize the price of Airbnb around the University of Toronto

## 1. Run basic Python scripts
Python is a high level programming language designed to emphasize readability. There are many reasons why someone might want to use Python, though the main reasons why we have opted to use Python in this course is that it has an extensive list of data science libraries. In fact, you are likely currently using Anaconda, a set of Python tools designed specifically for our use case. 

Traditionally, the very first thing that people do when learning a new programming language is learn how to print `hello world!`. In Python, this is really straight forward. Try running the following code by selecting the cell and clicking `Run` (alternatively, you can select `shift + enter`).

In [None]:
# prints 'hello world!'
print('hello world!')

If you got this running and read 'hello world!' in your notebook, congratulations, you may have just run your first Python script! This is a bigger accomplishment than you might think. You have taken your first steps into the wonderful world of code.

Let's break down the concepts above a bit. First, you probably see the `#prints 'hello world!'` bit at the top. This is a comment. Comments are cool because they describe the code that you are executing, without themselves being executed. These are an essential part of writing good code because they tell other readers what the code does. It might seem silly at first, but once you start collaborating with other people you *will* understand the value of comments.

The second thing worth discussing is the `print('hello world!')` bit. Python's `print()` function is a default function in Python which prints the string contained within the parentheses in the console. This is super handy when trying to debug code, and serves as a great starting point for our tutorial.

Another major difference between this exercise and past exercises is that we are using a notebook. Notebooks are increasingly popular among researchers and data scientists. In fact, I feel comfortable saying that it is _an essential tool_ for doing data science work. Without tools like Jupyter, we will create mess; replicable science needs *documentation*. Using Jupyter, we are able to effectively share code, but perhaps more importantly, to describe it. In many ways, the art of documentation is a subject in its own right, which we will touch on throughout this course.

Jupyter notebooks are divided into a series of cells, which each contain code that can be written and transferred however you would like. You have already created and executed a cell above, when you ran your first Python code.

By default, Jupyter notebooks support creating documentation using a type of code called *Markdown*. Unlike Python, which is designed to process logic, Markdown is designed to facilitate the creation of text. In other words, Markdown is not a programming language; rather it is a *markup language*. More specifically, Markdown is a derivative of the hypertext markup language (a.k.a. HTML) and builds on HTML's fundamentals. If you have ever taken an introduction to webpage design course, markdown will be very familiar to you; you can even create markdown text using HTML code!

We will not get into the details on how to create markdown files today, though we will slowly build this skill up as we make out way through the course. For now, the most important things that you need to know are that:

1. You can specify the type of code that each cell belongs to using the dropdown box immediately to the right of the `Run` button.
2. You can modify Markdown cells much like Python cells.

## 2. Assign Python variables
Like an Excel spreadsheet, programming languages can store data in the form of variables. In Python, variables can consist of many different types of data--integers, strings, floats, and more complex data structures such as lists and dictionaries. Using data stored in variables, we can create code that does virtually anything that we can imagine!

Building on the main theme of this book, let's learn by doing. Last year, there were 122 people registered for my undergraduate data management class. If we wanted to save this number in a variable, we could write the following Python code.

In [None]:
enrollment = 122

Python has now saved a variable called `enrollment` and knows that its value is the integer `122`. Try running the code below to see if Python remembers the number of people enrolled.

In [None]:
enrollment

This is good. However, this example is only somewhat accurate. Many students in my classes do not join us in person, and for various reasons can only join us online. If we wanted to store this data properly, we might want to create three variables: one containing the in person enrollment, one containing the online enrollment, and one containing the total enrollment. Let's try expressing them using the code below. 

In [None]:
in_person_enrollment = 88 # the number who took the course in-person

online_enrollment  = 34 # the number who attended remotely

# the total number attending the course
total_enrollment = in_person_enrollment + online_enrollment 

Finally, if we wanted to retrieve any of these variables, you can do so any time in Jupyter. For example:

In [None]:
in_person_enrollment

In [None]:
online_enrollment

In [None]:
total_enrollment

## 3. Perform basic string manipulation
One of the most interesting features of Python is the way that strings are stored. As mentioned earlier, strings consist of a series of characters that are often stored in a variable. Python has a number of default features that make it easy to change strings to suit our needs. For instance, Python treats strings as a series of elements (often called an 'array') which makes it easy to break them apart. If we wanted to retrieve the first character in the `course_title` string, we could easily retrieve it by using the following code.

In [None]:
course_title = "2022 Working with Data"

This is a very handy feature for data scientists. You will often have to manage textual data and having the ability to easily separate strings will save you no only time, but also a great deal of grief. Let's use a more concrete example. If we wanted to only retrieve the course year in the `course_title` string, we could create a variable called `course_year` which consists of only the first four characters.

In [None]:
#retrieves the first four characters in the course_title string
course_year = course_title[0:4]

course_year

Python also comes with a number of default functions designed for string manipulation. We have already seen the `print()` function, which is one such example. Print typically prints the string values that are contained in the function's parentheses. Another handy function is `len()`, which gives us the length of a string. 

In [None]:
len(course_year) #specifies the number of characters in the string

Another handy feature of Python is that strings can be easily concatenated. In Python, we can concatenate strings by simply using the `+` character. If we wanted to add the name of the course to the course title, we can easily do that.

In [None]:
#adds the name to the course title
course_title = course_title + ": Data Managmeent for Business and Social Science students (of all scholarly stripes)"

print(course_title) #prints the new course title

# 4. Turn your dataset into a dataframe and build a simple query

With this, you probably understand the basics. Let's do something a little more fun. One of the great things about Python is that it contains thousands of _libraries_ that are maintained by open source contributors. Libraries make our lives easier because they provide additional functionality that would take years to develop on our own. Throughout the rest of this exercise, we will use `Pandas` one of the most common Python data frame libraries, which is very often used by data scientists. 

The [Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started) dataframe is a data structure, sort of like a spreadsheet but much more complex, which makes it easier to navigate and analyze large datasets. Built upon numpy and other dependencies, this tool is among the most essential resources for conducting analysis on larger datasets. Like basic Python, we will use this tool in nearly all subsequent exercises, so be sure to watch this one closely.

Helpful reading: [10 minutes to pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)

It's also important to note that `pandas` is a framework built on top of the `numpy` library, which was originally designed to make data science easier. Numpy is a tool for transforming your data into a multi-dimensional array, sort of like a hyper-efficient spreadsheet. It's not great to use in its raw form unless you are interested in going deep into machine learning. Pandas transforms our data into numerical tables (a.k.a. data frames) which are easier to calculate and sort through.

Let's return to the Airbnb data originally explored in Chapter 6, only this time, we will explore Toronto, a much bigger city. To transform a csv file into a pandas object we need to import the pandas library. We can then import a csv file by using pandas' built-in read_csv feature.

In [None]:
import pandas as pd # import pandas 

# import numpy; it's usually a good practice to import this as well
import numpy as np

import matplotlib.pyplot as plt # we will use this to visualize later

# command pandas to import the data; isn't this easier than the csv library?
tor = pd.read_csv('data/9_toronto_listings.csv')

### Dataframe head
Once our data frame has been imported we can apply a few methods that can generate knowledge about the dataset. The `head()` method gives us a summary of the first five items in the dataset. If we specify `head(2)` then we will retrun just the first two values.

In [None]:
tor.head(2)

### Dataframe series

Data frames are easily navigable compared to lists or dictionaries. If we want to retrieve all of the data from a column in the dataframe, we can call that column similarly to calling a method. The code below will give us the values for `neighbourhood` from the whole dataset, but will give us only the first and last values when printed.

In [None]:
tor.neighbourhood

### Sort values
In addition, dataframes can be easily sorted. These sorting features are similar to SQL (_Structured Query Language_) which many of you will be familiar with. The following code will sort the data by price starting with the highest values. 

I wonder who seriously believes that they can rent an apartment for $13 437 per night?

In [None]:
tor.sort_values(by='price', ascending=False).head(2)

# 5. Create an advanced query
## Subsetting the data
Dataframes are for a lot more than performing large observations. Perhaps the coolest feature of a dataframe is that it facilitates efficient queries and to retrieve subsets of the data. In pandas, a subset is declared by writing square brackets following the data frame-- for instance, `tor['neighbourhood']` would return the values of neighborhood. However, we can also use this to conduct Boolean searches as well. For instance, if we wanted to retrieve only the values where `neighbourhood_group == University` we could write a query as follows.

In [None]:
university = tor[tor.neighbourhood == 'University']
university.head(2)

### Query using two conditions

Queries can also be more complex. If we wish to choose a subset of data which is constrained by two conditions, we can include both conditions by using the `&` operator. The following query will retrieve the values that match `University` which also have a `last_review` equal to `2023-06-04`, the date closest to when I retrieved this data.

In [None]:
recent_university = tor[(tor.neighbourhood == 'University') & 
                      (tor.last_review == '2023-06-04')]

recent_university.sort_values(by='price', ascending=False).head(2)

# 6. Collect descriptive statistics from your dataframe
One of the most handy features of pandas dataframes is that they come with a few built-in methods for conducting descriptive analysis. For example, the `.describe()` method will give summary of statistical measures of a given dataframe. Let's choose to apply `.describe()` to a single column, in an effort to give us a manageable amount of information.

In [None]:
tor.price.describe()

### Calculate the mean price, sum, and number of unique values
In addition, dataframes also have functions for calculating specific statistics such as mean, median and mode. 

Alternatively, if we wanted to find the sum of a column (e.g. the total number of reviews) we can use the sum function.

Finally, there are a few other methods which are handy. For instance, the `.nunique()` method will tell use the number of unique values in a dataset. These functions are demonstrated provided below.

In [None]:
tor.price.mean()

In [None]:
tor.number_of_reviews.sum()

In [None]:
tor.host_id.nunique()

# 7. Visually compare apartment types
Finally, we are now ready to visualize the data. Pandas dataframes have built-in functions for conducting visualizations that leverage a popular data visualization library called `matplotlib`. We will explore this library in more detail in other exercises, but I wanted to conclude this chapter's exercise by giving you a taste of what it can do. In this case, we may wish to find which room types have good deals. Let's return to the `university` segment.

To create an effective visualization, we may wish to simplify our data, such as by calculating the median value of each room type. Fortunately, Pandas has the `groupby` function. We can use this to calculate the median values and visualize the results.

In [None]:
university_rooms = university.groupby("room_type") # group by room_type

# calculates medians, plots the graph
university_rooms['price'].median().plot.bar(figsize=(6,3)) 

## Conclusion and credit exercises
Programming is tricker than using pre-built software, but it gives a lot of flexibility. Though getting to the point where we create simple visualizations seems underwhelming, especially when compared to our previous exercises with Tableau, this exercise demonstrates how it is doable. If anything, I hope you can appreciate how data scientists spend their days.

In the subsequent exercises we will build on these skills to demonstrate two core functions of data science: statistical analysis and machine learning. These exercises will also leverage the Jupyter notebook format to demonstrate the skills.

If you are completing this exercise as part of a course, please see your learning management system for the graded exercise questions.

## References
10 minutes to Pandas. https://pandas.pydata.org/docs/user_guide/10min.html

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in science & engineering, 9(03), 90-95.

Inside Airbnb. http://insideairbnb.com/get-the-data/