# Class 13: Data visualization continued and text manipulation

Plan for today:
- Review data visualization using matplotlib
- Data visualization using seaborn
- If there is time: text manipulation


## Notes on the class Jupyter setup

If you have the *ydata123_2023e* environment set up correctly, you can get the class code using the code below (which presumably you've already done given that you are seeing this notebook).  

In [None]:
import YData

# YData.download.download_class_code(13)   # get class code    
# YData.download.download_class_code(13, TRUE) # get the code with the answers 

YData.download.download_data("dow.csv")


There are also similar functions to download the homework:

In [None]:
# YData.download.download_homework(6)  # downloads the homework 

If you are using colabs, you should install polars and the YData packages by uncommenting and running the code below.

In [None]:
# !pip install https://github.com/emeyers/YData_package/tarball/master

If you are using google colabs, you should also uncomment and run the code below to mount the your google drive

In [None]:
# from google.colab import drive
# drive.mount('/content/drive')

In [None]:
import pandas as pd
import statistics
import numpy as np
from datetime import datetime

import matplotlib.pyplot as plt
%matplotlib inline

## Warm-up exercises: data visualization with matplotlib

As some warm-up/review exercises, let's use matplotlib to create the following types of plots: 

1. Visualizing the relationship between two quantitative variables
2. Visualizing the distribution of a single quantitative variable in 2 different ways.
3. Visualizing categorical data in two different ways.

To do this we will visualize information on returns of the Dow Jones Industrial Average which is loaded below.


In [None]:
dow = pd.read_csv("dow.csv", parse_dates = [0])

dow.head(3)

#### Warm-up 1: Visualizing the relationship between two quantitative variables

To start with, create a visualization of the closing dow index value as a function of the opening dow value. 

As always, it is best to first think what you want to do, and then think about how to do in. For data visualization, this means thinking what *type of plot you want to create* (e.g., sketching it on paper first can be a good idea), and then thinking about how to write the matplotlib code to create the plot. 
 

In [None]:
# Visualize the relationship between opening and close DOW index values





#### Warm-up 2: Visualizing the distribution of a single quantitative variable

Now to practice creating a plot of a single quantitative variable, please creates visualization of just the opening dow value. 

Please do this with two subplots (1 row, 2 columns) where each subplot visualizing the opening value in a different way. Again, first thinking about what the plots show look like before creating them is a good idea. 

In [None]:
# Visualize the distribution of the opening price in two different ways











#### Warm-up 3: Visualizing categorical data in two different ways

Now let's visualize categorical data. The code below creates a DataFrame that, for each day of the week, counts how many times the market was open for trading stocks.  

Please use this data to create two subplots (1 row, 2 columns) where you visualize how many times the DOW Jones was open for each day of the week in two different ways. 


In [None]:
day_count = (dow
              .groupby("Day")
              .agg(count = ("Date", "count"))
              .reset_index()
             )

display(day_count)


In [None]:
# Visualize how many times the market was open on each day of the week
# Please do this in two different ways using subplots 











## Seaborn!

[Seaborn](https://seaborn.pydata.org/index.html) is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. 

I.e., it is built on top of of matplotlib but produces better looking plots that are easier to create. 

Let's start by examining different themes which can produce better looking plots. We can do this using the `sns.set_theme()` method. 


In [None]:
# Import seaborn
import seaborn as sns

# Apply the default theme
sns.set_theme()   # default style is 'darkgrid')
#sns.set_theme(style='whitegrid')

# Side note: Matplotlib also has themes
# plt.style.available
# plt.style.use('fivethirtyeight')


# Re-create a line plot of wheat prices over time here








### Plotting relationships between two quantitative variables

We can plot relationships between two quantitative variables using the `sns.relplot()` function


In [None]:
# plot relationship between gas and egg prices




#### Penguins!

Let's continue to explore the relplot using data on penguins. 

We will also look at mapping other features of our data onto visual properties including: 
- `x`, and `y` column names to be plotted (as we have done before)
- `hue`: The column name to be mapped to the color of the points
- `size`: The column name to be mapped to the size of points
- `style`: The column name to be mapped to the style of the markers
- `col`: fThe column name to be mapped to faceting to compare multiple subplots


In [None]:
# Let's look at some penguins
penguins = sns.load_dataset("penguins")

print(type(penguins))

penguins.head()


In [None]:
# plotting bill size on x, and y axes and other properties









### Plotting a single quantitative variable

We can plot a single quantitative variables using the `sns.displot()` function.

Properties we can set include
- `x`: The name of the data column you want to plot
- `hue`: The name of the column that colors each point
- `kind` The type of plot

Different options for `kind` are: “hist”, “kde”, “ecdf”


In [None]:
# plot the flipper length








### Plotting a quantitative variable for different categorical variable levels

We can plot a quantitative variable for different categorical variable levels using the `sns.catplot()` function.

We specify: 
- `x`: Cateogoral x-value column name
- `y`: Quantitative y-value column name
- `kind`: The type of plot

The `kind` argument can be set to the following: “strip”, “swarm”, “box”, “violin”, “boxen”, “point”, “bar”, or “count”


In [None]:
# plot flipper length for the different species using different kinds of plots







# also try “strip”, “swarm”, “box”, “violin”, “boxen”, “point”, or “bar”

<img src = "https://i.imgflip.com/1ezfdq.jpg">

## Text manipulation

A large part of Data Scientists' time is spent cleaning data, and a large part of data cleaning consists of manipulating text.

Let's explore some of the functions that are built into Python for manipulating strings of text. 


### 1. Changing capitalization

One of the most basic things we can do is to change the capitalization of a piece of text. 

One case where this comes up is when one is merging two DataFrames that have the same key values but the values have different capitalization. For example, one might have two DataFrames that have a column that has the names of different countries, but in one DataFrame the country names are capitalized and in the other they are not. 

Python strings have a number of methods to change the capitalization of words including: 

- `capitalize()`: Converts the first character to upper case
- `lower()`: Converts a string into lower case
- `upper()`: Converts a string into upper case
- `title()`: Converts the first character of each word to upper case
- `swapcase()`: Swaps cases, lower case becomes upper case and vice versa

Let's explore these methods by manipulating this [quote](https://www.brainyquote.com/topics/yale-quotes) from [Herman Melville](https://en.wikipedia.org/wiki/Herman_Melville): "a whale ship was my Yale College and my Harvard". 


In [None]:
melville_quote = "a whale ship was my Yale College and my Harvard"

melville_quote


In [None]:
# Capitalize the first letter 



In [None]:
# convert all letters to lower case



In [None]:
# convert all letters to upper case



In [None]:
# make the first letter of each word capitalized



In [None]:
# Make uppercase lowercase, and lowercase uppercase




### 2. String padding

Often we want to remove extra spaces (called "white space") from the front or end of a string. Or conversely, sometimes we want to add extra spaces to make a set of strings the same length (this is known as "string padding"). 

Python strings have a number of methods that can pad/trim strings including: 

- `strip()`: Returns a trimmed version of the string (i.e., with no leading or trailing white space). 
- `rstrip()`: Returns a right trim version of the string
- `lstrip()`: Returns a left trim version of the string

- `center(num)`: Returns a centered string (with equal padding on both sides)
- `ljust(num)`: Returns a left justified version of the string
- `rjust(num)`: Returns a right justified version of the string

- `zfill(num)`: Fills the string with a specified number of 0 values at the beginning

Let's use a modified version of Melville quote to explore this


In [None]:
melville_quote2 = "    a whale ship was my Yale College and my Harvard   "
melville_quote2

In [None]:
# strip the white space



In [None]:
# strip just the left the white space



In [None]:
# center the quote by padding with white space 
#. to have a total of 70 characters



In [None]:
# make a number have leading 0's 
# (why is this useful)




### 3. Checking string properties

There are also many functions to check properties of strings including:

- `isalnum()`: Returns True if all characters in the string are alphanumeric
- `isalpha()`: Returns True if all characters in the string are in the alphabet
- `isnumeric()`: Returns True if all characters in the string are numeric

- `isspace()`: Returns True if all characters in the string are whitespaces

- `islower()`: Returns True if all characters in the string are lower case
- `isupper()`:Returns True if all characters in the string are upper case
- `istitle()`: Returns True if the string follows the rules of a title

Let's test some of these methods out...


In [None]:
# checking if a string is all letters






In [None]:
# checking if a string is all numbers



In [None]:
# checking if a string only contains spaces




# also works for new line characters \n, and tables \t




In [None]:
# checking if a string is upper case



### 4. Splitting and joining strings

There are several methods that can help us join strings that are contained into a list into a single string, or conversely, parse a single string into a list of strings. These include: 

- `split(separator_string)`: Splits the string at the specified separator, and returns a list
- `splitlines()`: Splits the string at line breaks and returns a list

- `join(a_list)`: Converts the elements of an iterable into a string

In [None]:
# split the Melville quote at each space into a list



In [None]:
# split a string at each line into a list

poem = """Some say the world will end in fire,
Some say in ice.
From what I’ve tasted of desire
I hold with those who favor fire.
But if it had to perish twice,
I think I know enough of hate
To say that for destruction ice
Is also great
And would suffice."""



In [None]:
# join a string together

a_list = ["A", "Whale", "of", "a", "Tale"]





### 5. Finding and replacing substrings

Some methods for locating a substring within a larger string include: 

- `count(substring)`: Returns the number of times a specified value occurs in a string
- `rfind(substring)`: Searches the string for a specified value and returns the last position of where it was found. (also see `rindex()`)

- `startswith(substring)`: Returns true if the string starts with the specified value
- `endswith(substring)` : Returns true if the string ends with the specified value

- `replace(original_str, replacement_str)`: Replace a substring with a different string. 

In [None]:
# How many times does the word "my" occur in the Melville quote? 



In [None]:
# at what index does the first instance of "my" occur?



In [None]:
# does the quote start with "a"?



In [None]:
# does the quote end with Harvard? 



In [None]:
# replace a substring




### 6. Filling in strings with particular values

There are a number of ways to fill in strings parts of a string with particular values. Perhaps the most useful is to use "f strings", which have the following syntax such as: 

`f"my string {value_to_fill} will be filled in"`.

Where the value of the variable `value_to_fill` will be filled into the string. 

Let's try it out... 


In [None]:
# Let's use an f-string





In [None]:
# We can also do formatting with f-strings





### Example: string processing on webpages

As an example, let's do some string processing on webpages!


In [None]:
# Download a webpage and save it as a file called politics.html

import requests

url = 'https://www.foxnews.com/politics/white-house-doctor-says-biden-fit-serve-president'
r = requests.get(url, allow_redirects=True)
open('politics.html', 'wb').write(r.content)



In [None]:
# read in the file as a string called webpage_string
file = open('politics.html', 'r', encoding="utf8")
webpage_string = file.read()

# look at the first 300 characters 



In [None]:
# Replace a word on the webpage




In [None]:
# write updated string to a file
text_file = open("updated_politics.html", "w", encoding="utf8")
n = text_file.write(webpage_updated)
text_file.close()

<img src = "https://i1.sndcdn.com/avatars-000316245474-0yp1vu-t500x500.jpg">