# Tutorial 3: Working with data types in Python

There are 11 Tasks in this notebook (though warning some tasks include more than one task) and 3 sections in this tutorial notebook that you can choose between: 

1. String data 
2. Categorical data (and Boolean and Numeric and Missing) which includes some advanced bonus tasks
3. Date and time data (and string)

The aim of this tutorial notebook is to give you some (guided) hands-on experience working with different data types in Python. Which you can then compare with the approaches to working with these data types in R. 

If you are pair programming, switch who is the driver and who is the navigator at the start of each section. I have included switching prompts within sections before certain tasks as well. If you complete a section here and move to the R notebook, switch drivers then as well. 

In [None]:
# it is always good practice to load the necessary packages and modules at the start of your document
import pandas as pd 
from pandas.api.types import CategoricalDtype 
import numpy as np
import datetime as dt 
import re
import itertools 
from dateutil import parser, tz, relativedelta

## there is a future warning that looks scary, but does not matter to us at the moment, so this code supresses it
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## 1. String data 

### Task 1 

How would you access "solstice" in `string0` below in code?

In [None]:
string0 = "The summer solstice is on Thursday 20 June 2024"

<details><summary style='color:darkblue'>HINT 1: How to start breaking it down? CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

We learned about a function in the Python data types notebook which helps us to identify the index of a string (see section 4). From there, we can use that information to access or *slice* the string 

In [None]:
## your answer here



### <font color='green'>*If you are pair programming, switch who is driving now*</font>
Remember to run the code cells loading packages & modules when the new driver takes over

## Task 2 - 6

There is a corpus of common words in the R `stringr` package that we will use as our data for this task. 

The process of importing this data and making it workable for this task is a bit complicated. I have outlined the logical steps below in code and here in plain language.

First we need to read in the data. To do so, we use the `pd.read_csv('file.csv')` function from the `pandas` package. `read_csv` reads in data from a csv file automatically as a `pandas` data frame structure. Usually this is what we want (as we will see next week), but in this case `words` is a list, so we will then convert the data structure to a list. But oh no, it is a list within a list! We then need to flatten the list structure and join them to be separated by a space, creating a string that we can work with. You could leave the data structure as a list within a list or indeed as a list, but for the purposes of this week, we are learning how to interact with strings. 


In [None]:
# read in data
## my data is in a folder called data. If you do not have the same set up, update the file path accordingly 
word_data = pd.read_csv('../data/common_words.csv', header = None) # the first row is not a header, so I have specified header = None 

type(word_data) # indeed words is currently a data frame 

In [None]:
# view the data that has been read in as a pandas data frame 
print(word_data)

In [None]:
# now convert words to a list for this task 
word_list = word_data.values.tolist()

type(word_list) 

When reading in data, it is always good practice to print it to make sure it parsed as expected. For this we can use `print()`

In [None]:
# to see the list within a list structure if you are interested
print(word_list)

# notice we now have lists within a list [[...]]

In [None]:
# flatten the list structure, this uses the intertools module function chain 
word_list_flat = list(itertools.chain(*word_list)) 

print(word_list_flat)
# great, we are getting there! 

In [None]:
# join the lists to be a string separated by a space 
words = " ".join(word_list_flat)

In [None]:
print(words) # happy days - we are ready to go 

Now we are ready for the task! How many words: 

- Task 2: Start with "y"
- Task 3: End with "w"
- Task 4: Are exactly 3 letters long
- Task 5: Have 8 letters or more
- Task 6: contain only consonants
 
Tasks 4, 5, and 6 are a bit more tricky. 

To really stretch yourself, consider using code to produce the answer to the above questions once you have a solution. 

<details><summary style='color:darkblue'>HINT 1: How to start breaking it down? CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
To address these problems you will need to use regular expressions. There is a helpful Python regular expression [cheat sheet here](https://www.dataquest.io/wp-content/uploads/2019/03/python-regular-expressions-cheat-sheet.pdf)

<details><summary style='color:darkblue'>HINT 2: Useful functions. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

You do not need to manually count the string outputs, remember the `len()` function

<details><summary style='color:darkblue'>HINT 3: How to use code to produce the answer?. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

We learned how to do string interpolation this week in the Python data types notebook (see section 4)

In [None]:
## your answer here



## 2. Categorical data (and Boolean and Numeric and Missing) 

For this section, there is yet again a data set available in R that we will be using. This time the data comes from the `forcats` package. `forcats::gss_cat` is a sample of data from the General Social Survey, which is a long-running US survey conducted by the independent research organization NORC at the University of Chicago. As the survey has thousands of questions, the `gss_cat` data contains a small subset. 

Since this data set is provided by an R package, you can get more information about the variables **in R** with `?gss_cat`. 

As above, to read in the data we will be using our new friend, the the `pd.read_csv('file.csv')` function from the `pandas` package.


In [None]:
# read in data
## my data is in a folder called data. If you do not have the same set up, update the file path accordingly 
gss_cat = pd.read_csv('../data/gss_cat.csv')

gss_cat

In [None]:
## we will learn more about data frames next week but one function to get a summary of what a data frame contains is data.info()
## data.info() is similiar to glimpse() in R 

gss_cat.info()


### Task 7
What data types are the different variables in `gss_cat`? If there is categoical data, should it be nominal or ordinal? What are the categories of the categorical data? Should or could any of the variable be represented differently in terms of data type?

Go through the data set one column at a time answering the above questions for each of the 8 columns. 

<details><summary style='color:darkblue'>HINT 1: Some new functions! CLICK HERE TO SEE THE ANSWER.</summary>

This task asks for you to do something we are familiar with from last week - thinking about data types - but to do so you need to use some code we have not learned about yet for working with dataframes. To select a single column, use square brackets `[]` with the column name of the column of interest as a character string - e.g., `dataframe["column"]`. We will discuss this more next week, for those interested, the returned object is a `pandas Series` (which we used some in the Python Data Types Notebook). 
    
Then you can use the `data.head()` function to see the first few rows of data as well as the data type.
    
You can also use `data.decribe()` to get a summary of the variable. To get more higher level information you can use `data.info()`

<details><summary style='color:darkblue'>HINT 2: A new data type (gasp there's more!). CLICK HERE TO SEE THE ANSWER.</summary>

You will notice that some variables/columns are something called type `object`... but what does this mean?? In `pandas` a `object dtype` represents text or mixed numeric and non-numeric values. An `object` is a string in `pandas` so it performs a string operation instead of a mathematical one. Thus, if you want a variable to be treated as categorical (`dtype category`) you need to explicitly cast it as such. The simplest way to convert a column to a categorical type is to use `astype('category')`.

In [None]:
## your answer here



### <font color='green'>*If you are pair programming, switch who is driving now*</font>
Remember to run the code cells loading packages & modules and reading in the data for this section when the new driver takes over

## Task 8 

Make `age` a factor. When modifying an object by changing values or the data type, it is good practice to create a new object with a meaningfully modified name rather than over-write the original one. 

In [None]:
# first let's take out age as a pandas Series 
age = gss_cat["age"]

In [None]:
type(age)

In [None]:
## your answer here



### Task 8 Advanced 

It is a bit more advanced to from here make `age` an ordinal categorical data type with 5 levels: 18-25, 26-44, 45-64, 65-74, 75+. To do so, we need to use `pandas.cut()` to sort out data values into bins. 

**Before looking at the solution, challenge yourself to think about the logical steps needed to solve this problem. Write them down and see how they match up to the solution provided.**

##### Your answer to the logical steps 

1. ....


2. ....


3. ....


4. ....


....

As we have not learned about this function yet, I will show you the solution and ask for you to try and figure out how it works. Modify some of the code to see what happens. Do not worry, you cannot break your computer (unless you throw it perhaps)! If you have any errors you cannot figure out, ask one of the teaching team for help during the tutorial or post on the discussion boards afterwards. 

Before making any changes to our variable, it is good practice to check if there are any missing values lurking in the shadows trying to ruin our day. 

In [None]:
age_f.isna().values.any()

## try to run the above about without .values.any() to see why we need it 

In [None]:
# next we can use sum to see how many there are 
## becuase is.na() returns Boolean values and True is truthy, we can use sum... understanding data types is so useful! 

age_f.isna().sum()

So we do indeed have some missing values (76 to be exact), which we will keep in mind. 

In [None]:
age_groups = pd.cut(
    age_f,
    bins = [-np.inf, 25, 44, 64, 74, np.inf],
    labels = ["18-25", "26-44", "45-64", "65-74", "75+"]) 

In [None]:
age_groups.head() # looking good 

In [None]:
age_groups.cat.categories # celebration!

In [None]:
# and look at that, pandas.cut() made it ordered for us too! 
## this is because the default behavior of the argument ordered is True
print(age_groups.cat.ordered) 

Reading documentation is a skill that you will develop over time with practice. Try and read the documentation for the cut function from pandas to see what you and learn. Ask a member of the teaching team during the tutorial or post on the discussion boards if you get stuck.

In [None]:
# look at the documentation for more info 
help(pd.cut)

### <font color='green'>*If you are pair programming, switch who is driving now*</font>
Remember to run the code cells loading packages & modules and reading in the data for this section when the new driver takes over

### Task 9 

How could you collapse `rincome` into a small set of categories (e.g., `"Unknown"`, `"less than $5000"`, `"$5000 to $9999"`, `"$10000 or more"`)?

Look at some summaries of the object and think about some of the challenges that you need to overcome to complete the task. Write down the steps you would need to take in plain language, regardless of if you know how to do it in code. Understanding **what** you need to do is just an important, if not more so, than **how** you will do it (i.e., in code).

You can also look back at your solution to Task 7 above. 


In [None]:
# first let's take out rincome as a pandas Series 
rincome = gss_cat["rincome"]

print(rincome.describe())

In [None]:
print(rincome.head())

In [None]:
print(rincome.unique())

##### Your answer to the logical steps 

1. ....


2. ....


3. ....


4. ....


....


### Advanced bonus task (Task 9) 

The advanced bonus task, should you choose to accept it, is to attempt your solution to Task 9 in code! See how far you can get! Have a look at the solutions document for a worked solution to this task. 

In [None]:
## your answer here



## 3. Date and time data (and string)

### Task 10 

Create an object showing the date 140 days from now and print the output nicely formatted (`"month day, year at hour minute"`) using `strftime()`. Then create an object with the date 2 years from now and similarly print the output nicely formatted. 

<details><summary style='color:darkblue'>HINT 1: How to start breaking it down? CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>
    
`timedelta` instances allow for arithmetic, but only at the level of days, hours, minutes, or seconds. To add or subtract intervals larger than a day, such as a month or a year we use `relativedelta`

In [None]:
## your answer here



### <font color='green'>*If you are pair programming, switch who is driving now*</font>
Remember to run the code cells loading packages & modules when the new driver takes over 

## Task 11 

This is a big one, so I have separated the task into different parts. By the end, you will have made a countdown clock to your birthday! (how cool!)

We will start by making a countdown clock until the annual Fringe Festival in Edinburgh in August. The festival starts on 2 August 2024 at 13:35

### Step 1

Create a datetime object with the Fringe date.

In [None]:
## your answer here



### Step 2

Create a countdown date object using arithmetic from the fringe datetime until now. This will be a `timedelta` data type, which represents the time between 2 `datetime` instances.

<details><summary style='color:darkblue'>HINT: Useful functions. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

Use one the `dt.datetime` functions to get the date and time now, rather than hard coding it 

In [None]:
## your answer here



### Step 3

Write an interpolating character string which will take our countdown object and tell us how many days until Fringe! 


In [None]:
## your answer here



### Step 4

We have a minumum viable product (MVP) for our task, which is great! *BUT* we can improve our countdown accuracy using timezones (i.e., aware objects)! Let's say we want a countdown specifically for someone living in California in the United States.  

Create a second datetime object for the Fringe date and set the correct time zone (i.e., Edinburgh).

In [None]:
## your answer here



### Step 5 

Now create a datetime object for the now time in California

<details><summary style='color:darkblue'>HINT: Useful functions. CLICK HERE TO SEE THE ANSWER. BUT REALLY TRY TO DO IT YOURSELF FIRST!</summary>

* Use one of the timezone `tz` functions we learned about to create time zones not reported by your system
* Use one the `dt.datetime` functions to get the date and time now, rather than hard coding it 

In [None]:
## your answer here



### Step 6 

Now we are ready again to create a second countdown date object using arithmetic with the aware datetime objects we have create for the fringe until now (in California, USA).

In [None]:
## your answer here



### Step 7 

Final step, write an interpolating character string which will take our countdown object and tell us how many days until Fringe! 


In [None]:
## your answer here



### Bonus 

When creating date, time, or datetime objects, you can use the `parser.parse()` function from `dateutil` which takes a string and parses (reads) the date into Python for you!

In [None]:
## for example 

example_date = parser.parse("1 January 2024 1:00AM")

print(example_date)

### <font color='green'>*If you are pair programming, switch who is driving now*</font>
Remember to run the code cells loading packages & modules when the new driver takes over 

### Step 8 

Put it all together and instead of Fringe, use your next birthday! If you want to use aware datetime objects, guess which timezone you may be in on your birthday. Be sure to update the interpolating string to reflect the new countdown event

In [None]:
## your answer here



---

## Well done! 🎉 

Well done! You have completed all of the tasks for the Python notebook for this tutorial. If you have not done so yet, now move to the R notebook.

---
*Dr. Brittany Blankinship (2024)*