# Lab 1

## Topics covered:
1. Intro to packages 
2. Data Science Life Cycle
3. Data Structures in Python
4. Pandas
5. Visualizations and plotting with Dataframes
6. Intro to APIs
7. Intro to NLP tasks
8. Regular Expressions
9. Concordances and Collocations


# Intro to Python Packages

In [None]:
# load the packages we will use for the remainder of the lab.
import pandas as pd 
import numpy as np 
import regex as re 

# We are using shorthands for package names. This allows us to call these packages as "pd" instead of "pandas"

Understanding packages in python:
The above cell **imports** certain "packages" which allow particular functionalities within the notebook. Each "package" has a documentation which you can check out for reference. It's generally good practice to consult these documentations for reference and help with code.

* Numpy Basics: https://numpy.org/doc/stable/user/absolute_beginners.html

* Data 8 Numpy manuplation guide: http://data8.org/su22/python-reference.html

* Pandas: https://pandas.pydata.org/docs/user_guide/index.html

* Pandas Dataframe manuplation: https://https://pandas.pydata.org/docs/reference/frame.html

* Matplotlib: https://matplotlib.org/stable/tutorials/introductory/usage.html

The following video provides an insight into the data science lifecycle, aka steps taken to analyze data and produce results.

In [None]:
## Do not edit this code block
!pip install pytube 
## if you use an exclamation point, you are able to access the terminal, and thus install a package from the notebook
## in this case, we're intalling "pytube" package which lets us view youtube videos inside the notebook

from IPython.display import YouTubeVideo
YouTubeVideo('5I2bYqeFQy4')

# **Part 1 - Data Structures in Python**

## String

In Python, a string is a sequence of Unicode characters. Strings can be created by enclosing characters inside a single quote or double-quotes. 

In [None]:
#All 4 strings are the same 
String = "Hello World" 
String = 'Hello World' 
String = """Hello World""" 
String = "Hello" + " " + "World"
print(String)

In [None]:
first_letter = String[0] # Indexing in python start at 0 
last_letter = String[-1] # Indexing into the last character 
for letter in String: # This is a for loop.
  print(letter)       # What we are saying is - "FOR every letter (or element) 
                                                #DO X - in this case, "print" each "letter"

## List
Python **lists** are indexable data structures which can hold different kinds of data as **items**. They may look something like the following:

```
friends = ['Rachel', 'Chandler', 'Joey', 'Monica', 'Ross', 'Phoebe']

odd_nums = [1, 3, 5, 7, 9]
```

To retrieve an element in a list, you would need to index into this list, starting with an index value of 0. For example, if you wanted to return the second integer in *odd_nums*, you would do the following:

In [None]:
odd_nums = [1, 3, 5, 7, 9]
odd_nums[1]

**Question 1: Replace the ellipses and try to return 'Joey' from the `friends` list**

In [None]:
friends = ['Rachel', 'Chandler', 'Joey', 'Monica', 'Ross', 'Phoebe']
...

**Numpy array** is another data structure we commonly use in python. It works similar to a **list**, as shown below. 

In [None]:
array = np.array(["Appeals" ,"Supreme", "Court" , "Justice"])
#Prints the first and second item of the array 
print(array[0])
print(array[1])

## Dictionary
The **dictionary** data structure is useful for holding **pairs of data**, known as **"key-value"** pairs. Dictionaries look like 

```
Dictionary = { key : value , key : value , key : value}
```

You might notice that unlike lists that use `[]`, dictionaries use `{}` followed by a **key : value pair**. Each key-value pair is separated by a comma.

For example if we wanted to create dictionary called "states" where:
* the **key is the name of the state** 
* the **value is the state abbreviation**.


we would use the following code:

In [None]:
states = {"California" : 'CA',
          "Idaho" : 'ID',
          "Nevada" : 'NV'}
states

You can **access** the abbreviations by indexing into the dictionary with brackets and the key value. For example, if you want to return `VALUE` associated with `KEY`, you would do the following:

```
example_dict[KEY]
```

This would return `VALUE`.

**Question 2: How would you return the abbreviation for Nevada using the states dictionary above? Assign *result* to this expression.**

In [None]:
# Using the states dictionary above, assign result to 'ID' by replacing the ellipses
result = states ["Idaho"]
result

# Loading and Inspecting Data with Pandas
First, we will use the pandas package to read in two different types of file formats csv and json. Both of these files are "sample data" from google colab.

The functions **read_csv** and **read_json** take in one argument, the **file path** of the file you wish to read in. In this case, the file is in a directory called "data"

In [None]:
df = pd.read_csv("data/california_housing_train.csv")
df2 = pd.read_json("data/anscombe.json")

Now that we have loaded in our DataFrames we want see our new table. Here are a few useful methods to see our table. 

* We can use the **`head()`** function to see the first 5 rows of our table. Alternatively we can use the `tail()` to see the last 5 rows of our table. 
* The **shape** method returns a tuple where the first item is the number of rows and the second is the number columns in the table. 

```
df.head()
df.tail()
df.shape
```

* Now that we know what are table looks like and the shape of it, we can use the **`describe()`** function to see basic summary statistics from our table. 

```
df.describe()
```

In [None]:
# Show first 5 rows of our data
df.head()

In [None]:
# Show last 5 rows of our data
df.tail()

In [None]:
rows , columns = df.shape[0] , df.shape[1]
print("The number of rows is: " + str(rows), "  The number of columns is: " + str(columns))

In [None]:
df.describe()

## Accessing Values

Often, it is useful to access only the rows, columns, or values related to our analysis. We'll look at several ways to cut down our table into smaller, more digestible parts.

Let's say we wanted to grab only the first _three_ rows of this DataFrame. We can do this by using the **`loc`** function; it takes in a list or range of numbers, and creates a new DataFrame with rows from the original DataFrame whose indices are given in the array or range. Remember that in Python, indices start at 0! Below are a few examples:

In [None]:
df.iloc[[1, 3, 5]] # Takes rows with indices 1, 3, and 5 (the 2nd, 4th, and 6th rows)

In [None]:
df.iloc[[7]] # Takes the row with index 7 (8th row)

In [None]:
df.iloc[np.arange(7)] # Takes the row with indices 0, 1, ... 6

Similarly, we can also choose to display certain columns of the DataFrame. There are two methods to accomplish this, and both methods take in lists of either column indices or column labels:
- Insert the names of the columns as a list in the DataFrame
- The **`drop`** method creates a new DataFrame with all columns _except_ those indicated by the parameters (i.e. the parameters are dropped).

Some examples:

In [None]:
df.loc[:, ["housing_median_age", "total_rooms"]].head() # Selects only "housing_median_age" and "total_rooms" columns

In [None]:
df.drop(df.columns[[0, 1]], axis=1).head() # Drops the columns with indices 0 and 1

In [None]:
df.iloc[[1,2,3,5], [3,5]] # Select only columns with indices 1 and 5, 
                          # then only the rows with indices 1, 2, 3, 5

To make sure you understand the `loc`, `iloc`, and `drop` functions, try selecting the columns from "total_bedrooms" to "median_house_value" with only the first 3 rows:

In [None]:
## YOUR CODE HERE
df.iloc[1:4, 4:10]

Finally, the `loc` function in the DataFrame can be modified so instead of only choosing certain rows or columns you can give conditions for the selected columns or rows:
- A column label
- A condition that each row should match

In other words, we call the select rows as so: `DataFrame_name.loc[DataFrame_name["column_name'] filter]`.



Here are some examples of selection:

The variable `median_house_value` indicates median house value in an area. The below query will find all rows (areas) of the house value is exactly 90000

In [None]:
df.loc[df["median_house_value"] == 90000]

The variable `population` corresponds to population in an area. With the following where statement, we'll find the variables where the population is between 1 and 100 (ie sparsely populated areas).

In [None]:
df_population = df.loc[df["population"].isin(np.arange(1, 100))]
df_population

## Sorting

It can be very useful to sort our DataFrames according to some column. The `sort` function does exactly that; it takes the column that you want to sort by. By default, the `sort_values` function sorts the table in _ascending_ order of the data in the column indicated; however, you can change this by setting the optional parameter `ascending=False`.

Recall that we created a `df_population` dataset above which contained areas with small population. Let's sort the values by the `median_house_value` column to see sparsely populated areas with expensive houses.   

In [None]:
df_population.sort_values(by=['median_house_value']) # Sort table by value of property taken in ascending order

## Manipulating the DataFrame
Next we will cover dropping unwanted values and duplicate rows using **`dropna()`** and **`drop_duplicates()`** respectively. Both of these functions return a new DataFrame without changing the original by default. 

In order to store the new table you will have to **assign it to a variable**. This is generally the default behavior for most Pandas functions

In [None]:
new_df = df.dropna()
new_df.head()

In [None]:
print('The number of rows after droping N/A values is', new_df.shape[0])

In [None]:
new_df = new_df.drop_duplicates()
new_df.head()

In [None]:
print('The number of rows after droping duplicate values is',new_df.shape[0])

Hooray! Seems like our data doesn't have any N/A or duplicate values. Then we can proceed to explore more of our dataset. 

One way to manipulate DataFrame tables is to use **`df['column name']`** to return one column of the table. 

In the example below we look at the **housing_median_age** column and use **`value_counts()`** to see how many times each unique value appears. Similarly, you can also apply other functions to columns. 

Adding new columns also uses this **`["column name"]`** syntax. You can specify **`df["column name"]`** and set it equal to the data you want to add. For example if you wanted to add a column of names with all upper case letters.
```
df["upper_case_names"] = df["names"].str.upper() 
```

In [None]:
df['housing_median_age'].value_counts()

**Question 3: How would you add a column for log population? Replace ... with your answer. (Hint use np.log())**

In [None]:
df["log_population"] = ...
df

You can also use 

```
df[df["column"] == Condition]
```
to only keep rows where a certain condition is met.

In [None]:
# Example: Only keep rows where population is above the average
df[df["population"] > np.mean(df["population"])]

If you want to select more than one column at a time and/or a certain number of rows you can use 


```
df.loc(: , ['column_name' , 'column_name']) 
```

where the first argument is the index you want and the second argument is the list of columns you want. 

The " : " after **`.loc(`**  is shorthand for **all**. This example gives you all the rows for the two columns. 

## Summary of pandas functions

As a summary, here are some useful functions that you can use with pandas.
    
|Name|Example|Purpose|
|-|-|-|
|`DataFrame`|`DataFrame()`|Create an empty DataFrame, usually to extend with data|
|`pd.read_csv`|`pandas.read_table("my_data.csv")`|Create a DataFrame from a data file|
|`pd.DataFrame({})`|`df = pandas.DataFrame({"N": np.arange(5), "2*N": np.arange(0, 10, 2)})`|Create a copy of a DataFrame with specified columns|
|`loc`|`df.loc[df["N"] > 10]`|Create a copy of a DataFrame with only the rows that match some *predicate*|
|`loc`|`df.loc["N"]`|Create a copy of a DataFrame with only specified column names|
|`(subsetting)`|`df[["N"]]`|Anothe way to create a copy of a DataFrame with only specified column names|
|`iloc`|`df.iloc(np.arange(0, 6, 2))`|Create a copy of the DataFrame with only the rows whose indices are in the given array|
|`sort`|`df.sort(["N"])`|Create a copy of a DataFrame sorted by the values in a column|
|`index`|`len(df.index)`|Compute the number of rows in a DataFrame|
|`columns`|`len(tbl.columns)`|Compute the number of columns in a DataFrame|
|`drop`|`df.drop(columns=["2*N"])`|Create a copy of a DataFrame without some of the columns|


# Plotting and visualizations with DataFrames
## Histograms
Histograms are a nifty way to display quantitative information. The x-axis is typically a quantitative variable of interest, and the y-axis is generally a frequency. Plot a histogram of the losses, and then experiment with the bin sizes by uncommmenting out the italicized text (removing the # symbol).

In [None]:
df.hist("median_house_value", grid = False)

# Try uncommenting the following lines and see the different results. What does this tell you about the parameters used?

#df.hist("median_house_value",bins = range(0,500000,1000) ,grid = False)
#df.hist("median_house_value" ,grid = False)
#df.hist("median_house_value",grid = True)

## Scatter Plots
Scatter plots are generally used to relate two variables to one another. They can be useful when trying to infer relationships between variables, visualize simple regressions, and get a general sense of the "spread" of your data.

In [None]:
# Median house value vs Median income. Do you spot a correlation?
df.plot.scatter(x = "median_income", 
                y = "median_house_value")

# **Part 2 - Introduction to API**

For LS190 , we'll be extensively working with the **[case.law](https://case.law/)** database - which is a database of **360 years of United States caselaw.** To access this data we'll need to develop a simple understanding of APIs.

* API is the acronym for **Application Programming Interface** - a pretty vague acronym, if you ask me (unless you are a CS person who can explain what this means). The term itself most likely comes from the early days of computing. 

* For the purposes of this class, the important thing to note is that an **API allows us to interact with, access and download data from the case.law database.** This is why a **[case.law API KEY](https://case.law/docs/site_features/api)** becomes important - as this key allows you to download the case.law data. You can think of an API key as a magical phrase like "Open Sesame" - which lets you access a database where hidden treasures of data await.

* The overwhelming amounts of data has made **APIs and API KEYS** the standard means of accessing data. Different organizations which store data have different means of accessing it. For example, there's the **[Twitter API](https://developer.twitter.com/en/docs/twitter-api)** if you want to study tweets. Or **[NYTimes API](https://developer.nytimes.com/apis)** if you want to study the New York Times Archive.

The API for case.law is very-well documented and you can find examples of how to use the API by following the various **[jupyter notebooks they provided](https://github.com/harvard-lil/cap-examples)**. 

In [None]:
import os
import sys
sys.path.append('..')

import lzma
import json

from config import settings_base as settings 
import utils

Above, we are importing a couple of libraries. 
* **`lzma`** allows us to decompress the case.law data
* **`json`** allows us to access the *dictionary* data structure 
* **`config`** is a folder which contains **settings_base** python script. This script should contain your **API KEY** 
 * Note that each API key is unique and different. Because these notebooks are published on github, we cannot share the API Key. This is why I ask you to get the API key from the kind folks at case.law as soon as possible.
* Finally, **`utils`** is a python script which contains helper functions written by case.law folks which allow us to download their data.

The examples code below is based on the **[example notebooks](https://github.com/harvard-lil/cap-examples)**  example notebooks  written by the wonderful people working at case.law. The case.law project is incredibly important as it allows us to access **huge amounts of case text data** without having to pay subscription for services like LexisNexis or Westlaw. This is a wonderful example of data democratization.

In [None]:
# Get Case Data for Hawaii (as it's a small-ish jurisdiction)
compressed_file = utils.get_and_extract_from_bulk(jurisdiction="Hawaii", 
                                                  data_format="json")

In [None]:
# Assume we are dealing with json data (if data_format is changed to xml or
# change this cell's os.path.join line)
if not compressed_file.endswith('.xz'):
  compressed_file = os.path.join(compressed_file, "data", "data.jsonl.xz") 

In [None]:
cases = []
print("File path:", compressed_file)
with lzma.open(compressed_file) as infile:
    for line in infile:
        record = json.loads(str(line, 'utf-8'))
        cases.append(record)

print("Case count: %s" % len(cases))

In [None]:
df = pd.DataFrame(cases)
df.head()

We now have access to case.law data, Hawaii dataset. Fascinating!

You can explore the data further here if you'd like. Note that the actual **text** is contained within a dictionary in the column named **casebody.** We will be exploring how to access the text of decisions below when we will be introducing **concordances**

Note:
I also, again, encourage you to check out the case.law **[example notebooks](https://github.com/harvard-lil/cap-examples)**. For instance, the **[Cartwright notebook - which shows who was Illinois' most prolific judge](https://github.com/harvard-lil/cap-examples/blob/develop/bulk_exploration/cartwright.ipynb)** is pretty fascinating.

# **Part 3 - Introduction to some simple NLP Tasks**

## Spacy

**[spacy](https://spacy.io/usage/spacy-101#features)** is an open-source NLP library which we'll be using to analyze features about textual data.

In [None]:
import spacy

text = "Today, I had a great time visiting Disneyland!"

# Configures pipeline for language processing - 
# we're basically loading a model called "en_core_web_sm" - a trained model for English language
nlp = spacy.load("en_core_web_sm")

# Creates a pipeline for the string associated with variable "text"
pipe = nlp(text)

for token in pipe:
  print(token)

Python's `split()` function is similar, but you'll notice that it's not so easy to separate words from punctuation, as compared to spaCy's tokenization.

In [None]:
# splitting on whitespace
text.split(' ')

**spaCy** is great because it's easy to retrieve various elements about the tokens within your text. We run a model that's already available - namely "en_core_web_sm"

For example, say with every token, you wanted to know the **part-of-speech** - is it a Verb or a Noun?

In [None]:
token_pos_pairs = []

for token in pipe:
  token_pos_pairs.append([token, token.pos_])

token_pos_pairs

The "lemma" of a word can be considered its base form. For example, *eating*, *ate*, and *eat* all have the same lemma: *eat*.

**Question 5: Using the for loop above as inspiration, create a list `token_lemma_pairs`. This list will be the same shape as token_pos_pairs, but contain the lemma rather than part-of-speech.**

In [None]:
token_lemma_pairs = []

for token in pipe:
  ...

# The shape of token_lemma_pairs should match token_pos_pairs in the last coding cell
token_lemma_pairs

Let's say now, we want to create a dictionary with the keys being terms within a text, and values being the number of times the terms are present. In this case, tokenization with spacy can help!

In [None]:
text = "Hey, what'd you think about the new movie?"
nlp = spacy.load("en_core_web_sm")
pipe = nlp(text).to_array('LANG')
pipe

In [None]:
# Replace the 4 ellipses in the for-loop

# word_counts is a dictionary with tokens as the keys and counts as the values
word_counts = {}

# Creates a pipeline of tokens for our text
text = "hi hi hi hi hi hey hello hey" #REPLACE WITH A FILE!!!! This is just a test
nlp = spacy.load("en_core_web_sm")
pipe = nlp(text)

tokens = []

for token in pipe:
  tokens.append(str(token))

# word_counts[KEY] accesses a VALUE (count) associated with a KEY (token/word)
for token in tokens:
  # If the word already exists in the dictionary, what should you do to its count? 
  if token in word_counts:
    word_counts[token] += 1
  # What if it doesn't exist?
  else:
    word_counts[token] = 1

word_counts

# REGEX - Regular Expressions

**Sets, Quantifiers, and Special Characters**

Regex (regular expressions) is a very powerful tool to find patterns in text. One of the best ways to learn Regex is by using Regex 101 to practice matching words in a body of text.

[Regex101](https://regex101.com/)

[Regex Reference Sheet](http://www.rexegg.com/regex-quickstart.html#ref)

For example, say we had a text and we wanted to find every instance of a word within that text.

In [None]:
# Just run this cell
import regex as re
text = "Samuel and I went down to the river yesterday! Samuel isn't a very good swimmer, though. Good thing our friend Sahit was there to help."

# the findall() function finds every instance of a specified word pattern within a text
re.findall(r'Samuel', text)

Let's say that instead of only wanting to find Samuel, we wanted to find every word in the text starting with 'Sa'. What would we do? Use pattern matching!

In [None]:
re.findall(r'Sa[a-z]*', text)

You may be wondering what the [a-z] in the Sa[a-z] pattern means. This is called a **set** in regex. When characters are within a set, such as  [abcde], any one character will match. However, regex has a special rule where [a-z] means the same thing as [abcde...xyz].

Here are some more:
~~~ 
[0-9]        any numeric character
[a-z]        any lowercase alphabetic character
[A-Z]        any uppercase alphabetic character
[aeiou]      any vowel (i.e. any character within the brackets)
[0-9a-z]     to combine sets, list them one after another 
[^...]       exclude specific characters
~~~


You still may be wondering how the entirety of Sahit was able to be matched if only one character within [a-z] would match. The answer is something called a **quantifier**!

Rules:
~~~ 
*        0 or more of the preceding character/expression
+        1 or more of the preceding character/expression
?        0 or 1 of the preceding character/expression
{n}      n copies of the preceding character/expression 
{n,m}    n to m copies of the preceding character/expression 
~~~

Say that now, you only wanted to return Samuel when the name was mentioned at the beginning of the text.

In [None]:
re.findall(r'^Samuel', text)

**Special characters**, such as the *^* which was just used in the pattern above, match strings that have a specific placement in a sentence. For example, *^* matches the subsequent pattern only if it is at the beginning of the string. This is why only a single 'Samuel' was returned.

Rules:
~~~ 
.         any single character except newline character
^         start of string
$         end of entire string
\n        new line
\r        carriage return
\t        tab

~~~

**Python RegEx Methods**

* `re.findall(pattern, string)`: Returns all phrases that match your pattern in the string.

* `re.sub(pattern, replacement, string)`: Return the string after replacing the leftmost non-overlapping occurrences of the pattern in string with replacement

* `re.split(pattern, string)`: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list. 

# **Part 4 - Concordances and Collocation**

## Concordances
A concordance view shows us every occurrence of a given word, together with some context. Concordances are fundamentally important if we want to understand the meaning of a word in a context.

Here we look up the word "petition" in casebody of the California Dataset by entering text followed by a period, then the term concordance, and then placing “petition” in parentheses.


In [None]:
import nltk
nltk.download('punkt')
from nltk.text import Text
from nltk.tokenize import word_tokenize # import tokenizer function from nltk


Recall that we created a dataframe called **df** which consists of downloaded cases from Hawaii. 
We will now try to access the text of those decisions.

In [None]:
case_body = df['casebody'][0]['data']['opinions'][0]['text'] # this function looks into casebody column
                                                             # first elememnt in it, aka [0], then 'data', then 'opinions'
                                                             # then the first element again [0], then text
                                                             # basically - it's complex data structure - 
                                                             # a dictionary with a list with a dicitonary with a list!
case_body[:10000] ## print the first 10000 characters of this text

In [None]:
case_body_tokenized = word_tokenize(case_body)
text = Text(case_body_tokenized)
text.concordance("petition")

**Question 6: write your own code to explore the occurrence of the word *court***

In [None]:
...

## Collocation
Collocations are expressions of multiple words which commonly co-occur. For example, the top ten bigram collocations in casebody of the downloaded case.law dataset are listed below, as measured using **Pointwise Mutual Information**. 

[A pretty good explanation of PMI](https://stats.stackexchange.com/a/522504) is given on **stackexchange.**

Note: stackexchange (and google generally) is a wonderful resource for all things relating to NLP and statistics. Although for this course, I do not emphasize concrete statistical knowledge or mathematical formulas, you should still get some **intuitive** understanding of what these measures like PMI do. 


In [None]:
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()
fourgram_measures = nltk.collocations.QuadgramAssocMeasures()

In [None]:
bigram_finder = BigramCollocationFinder.from_words(case_body_tokenized)
bigram_finder.nbest(bigram_measures.pmi, 20)

Do you see any interesting patterns that emerge from the dataset?

**Question 7: Using the methods defined above, find the top 10 trigram and fourgram collocations.**

In [None]:
# Trigram
...

In [None]:
# FourGram
...

Congratulations! You have finished Lab 1!