# Class 14: Data visualization continued and text manipulation

Plan for today:
- Review and continuation of data visualization using seaborn
- Text manipulation


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(14)   # get class code    
# YData.download.download_class_code(14, TRUE) # get the code with the answers 

YData.download.download_data("dow.csv")


There are also similar functions to download the homework:

In [None]:
# YData.download.download_homework(6)  # downloads the homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import pandas as pd
import statistics
import numpy as np
from datetime import datetime
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

## Warm-up exercises: data visualization with seaborn!

As some warm-up/review exercises, let's review using seaborn's `sns.relplot()` and `sns.displot()` functions. In particular, let's review: 

1. Visualizing the relationship between two quantitative variables using `sns.relplot()`. 

2. Visualizing the distribution of a single quantitative variable using `sns.displot()`.

For these warm-up exercises let's visualize information on the Dow Jones Industrial Average which is loaded below.


In [None]:
dow = pd.read_csv("dow.csv", parse_dates = [0])

dow.head(3)

#### Warm-up 1: Visualizing the relationship between two quantitative variables

To start with, use seaborn a visualization of the closing dow index value as a function of the date. Please create a plot with the following properties.

1. The plot should be a line plot (Hint: setting the `kind` argument to `line` could be useful).

2. The plot should have faceting so that there is one subplot for each day of the week. Hint: the `col` (and `col_wrap`) arguments could be useful. 
 
 
Q: Would have it have been a good idea to invest in the DOW after the [2008 economic crisis](https://www.youtube.com/watch?v=SPB1z-IprNU&t=608s)? 

In [None]:
# Visualize the relationship between the date and the close DOW index value

sns.set() 

sns.relplot(data = dow, 
            x = "Year", 
            y = "Close",
            kind = "line",
            col = "Day", 
            col_wrap = 3,
            alpha = .5);

plt.xlabel("Date");
plt.ylabel("Closing DOW value");

#### Warm-up 2: Visualizing the distribution of a single quantitative variable

Now use seaborn to create a plot of a single quantitative variable, by visualing the opening DOW values. 

Experiment with setting the `kind` argument to `hist`, `kde` and `ecdf`. 

Q: Which type of plot is best at showing the proportion of days that the DOW had a value less than 10,000.

In [None]:
# Visualize the distribution of the opening price of the DOW

sns.displot(data = dow, x = "Open", kind = "ecdf");
plt.ylabel("DOW index value");

## Seaborn continued

Let's continue to explore [seaborn](https://seaborn.pydata.org/index.html) using our penguine data which is loaded below...


In [None]:
penguins = sns.load_dataset("penguins")

print(type(penguins))

penguins.head()

### Review: plotting relationships between two quantitative variables

Let's review our `relplot()` function using data on penguins by look at mapping other features of our data onto visual properties including: 
- `x`, and `y` column names to be plotted (as we have done before)
- `hue`: The column name to be mapped to the color of the points
- `size`: The column name to be mapped to the size of points
- `style`: The column name to be mapped to the style of the markers
- `col`: fThe column name to be mapped to faceting to compare multiple subplots

In [None]:
# plotting bill size on x, and y axes and other properties
sns.relplot(data = penguins, 
            x = "bill_length_mm", 
            y = "bill_depth_mm",
            hue = "species",
            size = "body_mass_g",
            style = "island",
            col = "sex");

### Review: plotting a single quantitative variable

Recall we can plot a single quantitative variables using the `sns.displot()` function.

Properties we can set include:

- `x`: The name of the data column you want to plot
- `hue`: The name of the column that colors each point
- `kind` The type of plot

Different options for `kind` are: “hist”, “kde”, “ecdf”


In [None]:
# plot the flipper length
g = sns.displot(data = penguins, 
            x="flipper_length_mm", 
            #hue="species", 
            kind="hist");

g.set_xlabels("Flipper length (mm)");

### Seaborn continued: Plotting a quantitative variable for different categorical variable levels

We can plot a quantitative variable for different categorical variable levels using the `sns.catplot()` function.

We specify: 
- `x`: Cateogoral x-value column name
- `y`: Quantitative y-value column name
- `kind`: The type of plot

The `kind` argument can be set to the following: “strip”, “swarm”, “box”, “violin”, “boxen”, “point”, “bar”, or “count”


In [None]:
# plot flipper length for the different species using different kinds of plots
sns.catplot(data = penguins, 
            x = "species", 
            y = "flipper_length_mm", 
            kind = "strip");

# also try “strip”, “swarm”, “box”, “violin”, “boxen”, “point”, or “bar”

<img src = "https://i.imgflip.com/1ezfdq.jpg">

## Text manipulation

A large part of Data Scientists' time is spent cleaning data, and a large part of data cleaning consists of manipulating text.

Let's explore some of the functions that are built into Python for manipulating strings of text. 


### 1. Changing capitalization

One of the most basic things we can do is to change the capitalization of a piece of text. 

One case where this comes up is when one is merging two DataFrames that have the same key values but the values have different capitalization. For example, one might have two DataFrames that have a column that has the names of different countries, but in one DataFrame the country names are capitalized and in the other they are not. 

Python strings have a number of methods to change the capitalization of words including: 

- `capitalize()`: Converts the first character to upper case
- `lower()`: Converts a string into lower case
- `upper()`: Converts a string into upper case
- `title()`: Converts the first character of each word to upper case
- `swapcase()`: Swaps cases, lower case becomes upper case and vice versa

Let's explore these methods by manipulating this [quote](https://www.brainyquote.com/topics/yale-quotes) from [Herman Melville](https://en.wikipedia.org/wiki/Herman_Melville): "a whale ship was my Yale College and my Harvard". 


In [None]:
melville_quote = "a whale ship was my Yale College and my Harvard"

melville_quote


In [None]:
# Capitalize the first letter 

melville_quote.capitalize()

In [None]:
# convert all letters to lower case

melville_quote.lower()

In [None]:
# convert all letters to upper case

melville_quote.upper()

In [None]:
# make the first letter of each word capitalized

melville_quote.title()

In [None]:
# Make uppercase lowercase, and lowercase uppercase

melville_quote.swapcase()


### 2. String padding

Often we want to remove extra spaces (called "white space") from the front or end of a string. Or conversely, sometimes we want to add extra spaces to make a set of strings the same length (this is known as "string padding"). 

Python strings have a number of methods that can pad/trim strings including: 

- `strip()`: Returns a trimmed version of the string (i.e., with no leading or trailing white space). 
- `rstrip()`: Returns a right trim version of the string
- `lstrip()`: Returns a left trim version of the string

- `center(num)`: Returns a centered string (with equal padding on both sides)
- `ljust(num)`: Returns a left justified version of the string
- `rjust(num)`: Returns a right justified version of the string

- `zfill(num)`: Fills the string with a specified number of 0 values at the beginning

Let's use a modified version of Melville quote to explore this


In [None]:
melville_quote2 = "    a whale ship was my Yale College and my Harvard   "
melville_quote2

In [None]:
# strip the white space
melville_quote2.strip()

In [None]:
# strip just the left the white space
melville_quote2.lstrip()

In [None]:
# center the quote by padding with white space 
#. to have a total of 70 characters
melville_quote.center(70)


In [None]:
# make a number have leading 0's 
# (why is this useful)

"7".zfill(3)


### 3. Checking string properties

There are also many functions to check properties of strings including:

- `isalnum()`: Returns True if all characters in the string are alphanumeric
- `isalpha()`: Returns True if all characters in the string are in the alphabet
- `isnumeric()`: Returns True if all characters in the string are numeric

- `isspace()`: Returns True if all characters in the string are whitespaces

- `islower()`: Returns True if all characters in the string are lower case
- `isupper()`:Returns True if all characters in the string are upper case
- `istitle()`: Returns True if the string follows the rules of a title

Let's test some of these methods out...


In [None]:
# checking if a string is all letters

"abc".isalpha()

"abc123".isalpha()


In [None]:
# checking if a string is all numbers

"123".isnumeric()

In [None]:
# checking if a string only contains spaces

"   ".isspace()

"\n".isspace()   # also works for new line characters \n, and tables \t

In [None]:
# checking if a string is upper case

"I AM NOT YELLILNG!!!".isupper()

### 4. Splitting and joining strings

There are several methods that can help us join strings that are contained into a list into a single string, or conversely, parse a single string into a list of strings. These include: 

- `split(separator_string)`: Splits the string at the specified separator, and returns a list
- `splitlines()`: Splits the string at line breaks and returns a list

- `join(a_list)`: Converts the elements of an iterable into a string

In [None]:
# split the Melville quote at each space into a list

melville_quote.split(" ")

In [None]:
# split a string at each line into a list

poem = """Some say the world will end in fire,
Some say in ice.
From what I’ve tasted of desire
I hold with those who favor fire.
But if it had to perish twice,
I think I know enough of hate
To say that for destruction ice
Is also great
And would suffice."""

poem.splitlines()

In [None]:
# join a string together

a_list = ["A", "Whale", "of", "a", "Tale"]

" ".join(a_list)



### 5. Finding and replacing substrings

Some methods for locating a substring within a larger string include: 

- `count(substring)`: Returns the number of times a specified value occurs in a string
- `rfind(substring)`: Searches the string for a specified value and returns the last position of where it was found. (also see `rindex()`)

- `startswith(substring)`: Returns true if the string starts with the specified value
- `endswith(substring)` : Returns true if the string ends with the specified value

- `replace(original_str, replacement_str)`: Replace a substring with a different string. 

In [None]:
# How many times does the word "my" occur in the Melville quote? 
melville_quote.count("my")

In [None]:
# at what index does the first instance of "my" occur?
melville_quote.index("my")

In [None]:
# does the quote start with "a"?
melville_quote.startswith("a")

In [None]:
# does the quote end with Harvard? 

melville_quote.endswith("Harvard")

In [None]:
# replace a substring
melville_quote.replace("Harvard", "that other school that is almost as good")

### 6. Filling in strings with particular values

There are a number of ways to fill in strings parts of a string with particular values. Perhaps the most useful is to use "f strings", which have the following syntax such as: 

`f"my string {value_to_fill} will be filled in"`.

Where the value of the variable `value_to_fill` will be filled into the string. 

Let's try it out... 


In [None]:
# Let's use an f-string

person = "Herman Melville"

f"Mr. {person} liked writing about whales."



In [None]:
# We can also do formatting with f-strings

amount = 123
f"${amount:.2f} is a lot of money!"

### Example: string processing on webpages

As an example, let's do some string processing on webpages!


In [None]:
# Download a webpage and save it as a file called politics.html

import requests

url = 'https://www.foxnews.com/politics/white-house-doctor-says-biden-fit-serve-president'
r = requests.get(url, allow_redirects=True)
open('politics.html', 'wb').write(r.content)



In [None]:
# read in the file as a string called webpage_string
file = open('politics.html', 'r', encoding="utf8")
webpage_string = file.read()

# look at the first 300 characters 
webpage_string[0:300]

In [None]:
# Replace a word on the webpage

webpage_updated = webpage_string.replace("Biden", "Sleepy Joe")


In [None]:
# write updated string to a file
text_file = open("updated_politics.html", "w", encoding="utf8")
n = text_file.write(webpage_updated)
text_file.close()

<img src = "https://i1.sndcdn.com/avatars-000316245474-0yp1vu-t500x500.jpg">

## Regular expressions

Regular expressions are string with special characters that allow you find more complex patterns in pieces of text.

To use regular expressions in Python we can use the `re` module. 

If we convert the output of the `re.match()` function to a Boolean (i.e., `bool(re.match())`, we can tell if a piece of text contains a particular substring. 

Let's run to test to check if:

1. Our Melville quote contains the letter "a"
2. Our Melville quote contains the letter "z"


In [None]:
import re

# check if our Melville quote contains/starts with the letter a
print(bool(re.match("a", melville_quote)))


In [None]:
# check if our Melville quote contains/starts with the letter z
print(bool(re.match("z", melville_quote)))


A few special characters that can be used in regular expressions are:
- `^` means the start of a word 
- `$` means the end of a word 
- `[Pp]` means P or p

In [None]:
# check if our Melville quote starts with an upper of lower case A
print(bool(re.match("[aA]", melville_quote)))


In [None]:
# check if our Melville quote starts with a vowel
print(bool(re.match("^[aeiouAEIOU]", melville_quote)))

In [None]:
# check if our Melville quote does not starts with a vowel
print(bool(re.match("^[^aeiouAEIOU]", melville_quote)))

In [None]:
# we can use the period . to match any one character

bool(re.match("m.ss", "miss"))   # miss, mass, mess


In [None]:
# * means repeat the previous character 0 or more times
bool(re.match("xy*z", "xz"))   # xz, xyz, xyyz, xyyyz, ...

In [None]:
# + means repeat the previous character 1 or more times
bool(re.match("xy+z", "xz"))   # xyz, xyyz, xyyyz, ...

In [None]:
# will the following match?

bool(re.match(".*a.*e",  "pineapple"))  


#### Example: matching phone numbers

In [None]:
phone_strings = [ "apple", 
                 "219 733 8965", 
                 "329-293-8753", 
                 "Work: 579-499-7527",
                 "Home: 543.355.3679"]

phone_strings

In [None]:
phone_expression = ".*([2-9][0-9]{2})[- .]([0-9]{3})[- .]([0-9]{4})"

In [None]:
for i in range(len(phone_strings)):
    print(bool(re.match(phone_expression,  phone_strings[i])))


#### Escape characters

In [None]:
# Does not match because nothing after the end of a string
bool(re.match(".*$100", "Joanna has $100 and Chris has $0"))

In [None]:
# using escape characters can help
bool(re.match(".*\\$100", "Joanna has $100 and Chris has $0"))

#### Special characters

Other special characters are also designated by using a double slash first

`\s`   space

`\n`   new line     or also   `\r`

`\t`   tab


In [None]:
bool(re.match(".*\n", melville_quote))

In [None]:
bool(re.match(".*\n", poem))