# Examples

In this Jupyter Notebook we will go through three different examples using three different types of data (freeform text from a novel, spreadsheet population data, and spatially structured New York City, USA collision data) to highlight different possibilities in using Python for research. These are not exhaustive examples. They illustate key ideas and provide practice reading code (some of which is completely new!) and trying to use programming comments and documentation to make sense of it. Then potentially taking the next step of modifying the code to do something else. Please note that the instructional materials intentionally did not cover everything that you will see in this notebook as part of the learning is 'learning to learn' how to work through new types of Python code that you might see in tutorials or other research examples.

## What you will learn in this notebook: 
1. Read and Analyze Text Files
2. Read, Filter, and Visualize Comma Separated Variable (CSV) files from Excel
3. Map spatial data in a Comma Separated Variable (CSV) file from Excel
4. Reading new code including new data structures (e.g., dictionaries and lists) and new techniques (e.g., for loops)
5. Making small adjustments and modifications and re-running code to test if it works (i.e., don't change everything at once)

In short, this exercise walks you through examples and gives you reminders on where to look for help. So that you can start learning how to make sense of unfamiliar pieces of code and potentially use them to advance your own research.

In [None]:
# But first, this might be handy...
# Want to learn more about a certain function? Try help()!

# For example, if you want to learn more about the read_file function in the geopandas package you can simply type the following
# Try running this code to see what happens.
import geopandas

help(geopandas.read_file) 

## 1. Moby Dick, Read and Analyze Text

This example will illustrate how to analyze freeform text. The book is from The Project Gutenberg (see links below) of an epic tale of man versus nature.

 * HTML version (for humans) - https://www.gutenberg.org/files/2701/2701-h/2701-h.htm
 * Text version (for computers) - https://www.gutenberg.org/files/2701/2701-0.txt

In [None]:
# This code segment will read the first 1,000 lines from moby dick and will print them to the screen.

file_text = open("moby-dick-text.txt", "r", encoding='utf-8') # Open Moby Dick from Project Gutenberg
# Notice the 'utf-8'. This is called an 'encoding' which tells the computer what characters of text
# will be included. *If you work with non-English texts you will use different encodings for text.*
# This is important for social science and geography that often work outside of English only domains!

# Don't print everything so setup a line counter that starts at 0
count = 0

# Only print the first 1,000 lines
stop = 1000

# This is a for loop! This is likely the first time you have seen a for loop.
# Take a minute to search for 'for loops in Python' to learn a bit more
# Then try to use that knowledge to read the next few lines to see what this code is doing, then run it to see if you are right.
# To learn even more about for loops and iteration in general take a look here: https://www.py4e.com/html3/05-iterations
for line in file_text: # Use the file text
    print(line, end="") # Print one line
    count = count + 1 # Add one to the counter
    if count > stop:
        break # This is a reserved word that means quit (or break) the loop

file_handle.close() # Close the file

In [None]:
# Now let's learn how to install a new package.
# This package is called TextBlob
# We will use the Package Installer for Python (pip)
# The pip command for packages is listed below (the --user says that this package will be for you, the user, only)
# We use the "!" command to call the pip utility (note this is not Python, but Jupyter Notebook)
# Run this code cell first. You should see "Successfully installed textblob-#.##.#"
!pip install --user textblob

In [None]:
# This line is similar to import, but we will only import a submodule called TextBlob
from textblob import TextBlob

###   WARNING!!!!!   ###
# You will get an error saying that a resource for nltk is missing
# nltk stands for "Natural Language Toolkit" and is very useful for processing text
# Open a new cell by clicking the "+" and running the commands it instructs
# This is a common example of modules needing additional information/packages/data
# So this exercise will help you learn how to solve these small problems

file_text = open("moby-dick-text.txt", "r", encoding='utf-8') # Open Moby Dick from Project Gutenberg
# Notice the utf-8. This is called an 'encoding' which tells the computer what characters of text
# will be included. If you work with non-English texts you will use different encodings for text.

# Now that the file is open, we can read the file
moby_dick_text = file_text.read()

# Now that we have the string (moby_dick_text) we are done with the file
file_text.close() # So close it


# If you want to see what moby_dick_text looks like. Try printing it out

# Now we have the entire book stored in a variable (moby_dick_text) as a string.

# Let's try using the textblob module. For more information see the QuickStart guide
# https://textblob.readthedocs.io/en/dev/quickstart.html

# First, we will turn the string into a textblob using the imported TextBlob submodule
blob = TextBlob(moby_dick_text)

# We can do a few simple things like see how many words are in the average sentence
total_number_of_words = 0
count = 0

for sentence in blob.sentences:
    words_in_sentence = len(sentence.words) # Count the number of words in the sentence
    
    # Add the length of the sentence (number of words) to the total number_of_words count
    total_number_of_words = total_number_of_words + words_in_sentence

    count = count+1 # Add one to the count (for count of sentences)
    
print("The average number of words per sentence is:", total_number_of_words/count)
print("The number of sentences is:", count)

# This is one example of textblob. For another example scroll to the bottom for another more advanced example.

## 2. American Community Survey Housing Data

This example discusses how to read and visualize spreadsheet data from the American Community Survey. This example uses a number of packages including pandas (for reading and analyzing data) and matplotlib/seaborn for visualizing data

In [None]:
# Import a lot of packages
# This import style is using shorthand so that 'numpy' becomes 'np' in the code below. This saves typing for lazy developers.
# It is standard practice so something you should learn, because you will see it in code examples.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Read American Community Survey Housing Data into what is called a dataframe -> Think spreadsheet/table for Python
housing = pd.read_csv("ACS_17_1YR_DP04_with_ann.csv")

housing # Show the table

In [None]:
# Plot relationship between number of units and number of bedrooms (totals)

# Make sure our plots show up
%matplotlib inline 

sns.set(style="whitegrid") # This comment intentionally left blank
# Can you figure out what this does using web search?

# Remember sns stands for seaborn, so to learn more about scatterplot you can search 'seaborn scatterplot'
plot = sns.scatterplot(x="HC01_VC03", y="HC01_VC40", # x=total units, y=total rooms
                data=housing)


# Clean up plot of relationship between number of units and number of bedrooms (totals)
plot

In [None]:
# Clean up plot of relationship between number of units and number of bedrooms (totals)
plt.title('Estimated % owner versus rental vacancy rates')

plot = sns.scatterplot(x="HC01_VC03", y="HC01_VC40", # x=total units, y=total rooms
                data=housing)

plt.xlabel("Total Units")
plt.ylabel("Total Rooms")

plot

In [None]:
# The last plot was boring, lets look at vacancies

plt.title('Estimated % owner versus rental vacancy rates')
plot = sns.scatterplot(x="HC01_VC08", y="HC01_VC09", # x=% owner vacancy rate, y=% rental vacancy rate
                data=housing)
plt.xlabel("Est. % Owner Vacancy Rate")
plt.ylabel("Est. % Rental Vacancy Rate")
plot

In [None]:
# Combine plots, called jointplot, and plot out regression line.
# See here for example: http://seaborn.pydata.org/examples/regression_marginals.html
# Notice, this  is the same plot as above, but with slightly more information.
# Just because this is 'more' doesn't mean it is always 'better' (i.e. cooler looking isn't always more informative)

# Once you have this down, feel free trying to change the columns and creating your own plot. 
# Look up at the table to see column names.

plot = sns.jointplot("HC01_VC08", "HC01_VC09", data=housing, 
                  kind="reg", # Regression line
                  color="m",  # Try "r" or "b" or "m"
                  height=8)   # Height controls the size of the graph, try 6 or 10 too.

plt.xlabel("Est. % Owner Vacancy Rate")
plt.ylabel("Est. % Rental Vacancy Rate")

plot

# 3. Collision Data - Mapping Spatial Data

In this example you will use CSV spreadsheet data that contains latitude and longitude coordinates from crashes. You can then map these crashes in Python.

In [None]:
# Let's take a look at the data, which is New York City, NY, USA Vehicle Collisions
# Source of data: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/data
# NYC like many cities around the world have lots of open data, which can be downloaded and analyzed.
# This is simply an example of collisions from the year 2020.

import geopandas

crashes = geopandas.read_file("Motor_Vehicle_Collisions_-_Crashes.csv")

crashes

In [None]:
import geopandas
import pandas
from shapely.geometry import Point
from geopandas import GeoDataFrame

# Read the CSV file using pandas, store as a dataframe named df
df = pandas.read_csv('Motor_Vehicle_Collisions_-_Crashes.csv')


# EXPERIMENT TIME
# Try commenting out the following two lines in different ways. See what changes if any in the map.
# Why?!

df = df.dropna(subset=['LOCATION'])
df = df[df.LATITUDE != 0.0]

crs = {'init': 'epsg:4326'} # Set the coordinate reference system to 4326, aka WGS84 

# Warning, advanced technique you will see in examples.
# Learn to understand 'enough' of the code for it to work correctly, but don't feel like you need to master it every time
geometry = [Point(xy) for xy in zip(df.LONGITUDE, df.LATITUDE)] # This is called list comprehension, an advanced technique

# GeoDataFrame. That sounds cool. I want to learn more. Maybe I should run a search to learn more?!
geo_df = GeoDataFrame(df, crs=crs, geometry=geometry)

# Plot is cool.
# Can you change anything? What impact does it have?
geo_df.plot(marker='o', color='red', markersize=0.25)



## 4. Your Research Story

Now that you have tried a few Python examples using actual data, let's try going back to your research story. Try to think about how you could use one of these examples to start writing Python code for your research.

Break down the (coding) problem:
1. What are the steps (guided by your research design)?
2. What are the datasets that will need to be opened?
3. What format are the data and what do you want to do with the data?
4. What packages might help you? (e.g., if it is spreadsheet/Excel data, then pandas might be appropraite)
   * I didn't see any examples that fit my research!
   * Python has many packages, take a look below at a few popular packages, but know there are many, many more.
5. What output or visualizations do you want? (e.g., seaborn could help with plots)

Take it one step at a time by coding one piece at a time. Start by opening up your data. See if you need to clean it up (e.g., Example 3 required us to filter out some data). Learn how to clean the data if it is too dirty to analyze at first. Look up how to conduct your first analysis. Keep going, just one step at a time.

### Even more packages
Python has some many wonderful packages. A brief list of popular and common ones for science are listed below.

  * scipy
  * numpy
  * matplotlib (visualization)
  * bokeh (visualization)
  * seaborn (visualization)
  * nltk (linguistics)
  * TensorFlow/Keras (machine/deep learning - sorry not today ...)
  * pandas
  * scrapy (scraping information off the internet)
  * gensim (topic modeling, document indexing)
  * statsmodels (statistics and modeling)
  * folium (mapping)
  * gmaps (Google Maps)
  * geoplot (more mapping)
  * pysal (spatial analysis)
  * ...
  
  
### Python also has "Scientific Kits" or scikit
Entire list here: http://scikits.appspot.com/scikits

 * scikit-learn : machine learning : https://scikit-learn.org/stable
 * scikit-network : networking analysis (newer) : https://scikit-network.readthedocs.io/en/latest/
 * scikit-image : image processing : https://scikit-image.org/


## Try-It - Extras

In [None]:
# How many times is the word 'whale' mentioned in Moby Dick?
# This example will walk through an extended example of what can be possible through Python.

# We will use the BlobText package.
# A nice walkthrough with more capabilities than this example can be found here:
# https://stackabuse.com/python-for-nlp-introduction-to-the-textblob-library/


# This line is similar to import, but we will only import a submodule called TextBlob
from textblob import TextBlob

file_text = open("moby-dick-text.txt", "r", encoding='utf-8') # Open Moby Dick from Project Gutenberg
# Notice the utf-8. This is called an 'encoding' which tells the computer what characters of text
# will be included. If you work with non-English texts you will use different encodings for text.

moby_dick_text = file_text.read()

# Now that we have the string (mdtext) we are done with the file
file_text.close() # So close it

# If you want to see what mdtext looks like. Try printing it out

# Now we have the entire book as a string.

# Let's try using the textblob module. For more information see the QuickStart guide
# https://textblob.readthedocs.io/en/dev/quickstart.html

# First, we will turn it into a textblob using the imported TextBlob submodule
blob = TextBlob(moby_dick_text)

# We can use the words count function
whale_count = blob.words.count("whale")

print("The number of times 'whale' is mentioned in Moby Dick is:", whale_count)


# We can count the number of times the word 'man' is in here too.
man_count = blob.words.count("man")

print("The number of times 'man' is mentioned in Moby Dick is:", man_count)


# What about plural versions of words such as men?
# We can make all words singular and then count the word man
man_sing_count = blob.words.singularize().count("man")

print("The number of times 'man'/'men' are mentioned in Moby Dick is:", man_sing_count)

# We can do the same thing to whale.
whale_sing_count = blob.words.singularize().count("whale")

print("The number of times 'whale'/whales is mentioned in Moby Dick is:",whale_sing_count)

# How often do they talk about men versus man+men, same for whales?
men_ratio = (man_sing_count - man_count) / man_sing_count
whales_ratio = (whale_sing_count - whale_count) / whale_sing_count

print("Men ratio", men_ratio)
print("whales ratio", whales_ratio)

print("So it seems that it is less man versus whale and more men versus whale...")


In [None]:
# Let's load a prepared dataset called 'tips'

tips = sns.load_dataset("tips")

# This is similar to survey data in having categorical and numerical variables
# The beauty of using these prepared datasets is the vast number of online examples.
# Use the online examples as a guide. Then translate them to your own problem.

tips

In [None]:
#plot = sns.scatterplot(x="tip", y="size", # x=tip, y=table size
#                data=tips)

plot = sns.jointplot("tip", "size", data=tips, 
                  kind="scatter", # Scatter, but try different kinds
                  color="r", # Try "r" or "b" or "m"
                  height=8) # Height controls the size of the graph, try 6 or 10 too.

# Different ways of viewing the data. Look here for different kind's of view
# https://seaborn.pydata.org/generated/seaborn.jointplot.html


