# QF 625 Introduction to Programming
## Lesson 2 | An Introduction to `matplotlib` (featuring `SciPy`) | `RE`view

> Here we go again :) Let's begin our second half of the lesson.

> Data visualization is a critical process within a workflow of your analysis of financial data.

> Visualization of your data is important when you share your analysis and findings with your internal and external stakeholders.

> `matplotlib` is the most popular and widely used package in Python for data visualization.


![](matplotlib.png)

In [None]:
import numpy as np

In [None]:
import matplotlib.pyplot as plt
# from matplotlib import pyplot as plt

### Importing matplotlib and pyplot

> `Pyplot` is a collection of functions in the popular visualization package Matplotlib. 

> Its functions manipulate elements of a figure as follows:

- creating a figure, 
- creating a plotting area, plotting lines, plotting observations (data points)
- adding plot labels

> Let's use function `plot()` from `pyplot` to create a dashed line graph showing the pattern of a company's stock price.

> You can change the color of the line by adding the `argument color` and the linestlye by adding the `argument linestyle`.

##### Let's create two arrays: one array contains the first 100 days since IPO; another contains the stock price on each day since IPO.

##### How would you create the two arrays?

In [None]:
x = np.arange(1, 101, 1)
y = np.arange(1, 1001, 10)/3
print(x);print(y)

In [None]:
plt.plot(x, y)
plt.show()

In [None]:
plt.plot(x, y, color = "red", linestyle = "--")
plt.show()

#### Adding axis labels and titles

> You might want to add labels to your plot so it's clear to other people what information it is trying to convey. 

> Let's add labels to the plot you created in the last one.

In [None]:
plt.plot(x, y, color = "purple", linestyle = "-")
plt.xlabel("The number of days since IPO")
plt.ylabel("The price of the stock corresponding to days")
plt.title("A Company's Stock Prices During the First 100 days since IPO")
plt.show()

#### Overlyaing lines on the same plot

> Often you are in need to plot multiple groups as different lines. 

> To do so, execute the function plot() multiple times. 

> Let's plot the stocks of two companies over time.

##### Is there a way that we can generate 100 random integers that represent a company's fictitious stock prices?

In [None]:
z = np.random.randint(1, 400, 100)
print(z)

> Random decimal numbers?

In [None]:
np.random.rand(100)

> Let's create a plot with two lines.

In [None]:
plt.plot(x, y, color = "red")
plt.plot(x, z, color = "blue")

plt.title("A Company's Stock Prices During the First 100 days since IPO")
plt.xlabel("The number of days since IPO")
plt.ylabel("The prices of the stocks corresponding to days")
plt.legend(["Company A", "Company B"])
plt.show()

#### Scatter Plots

> The pyplot module can also be used to make other types of plots, like scatter plots. 

> Now, let's create a scatterplot of the companies stock prices over time.

In [None]:
plt.scatter(x, y, color = "red", label = "Company A")
plt.scatter(x, z, color = "blue", label = "Company B")

plt.title("A Company's Stock Prices During the First 100 days since IPO")
plt.xlabel("The number of days since IPO")
plt.ylabel("The prices of the stocks corresponding to days")
plt.legend()
plt.show()

> If you still wish to use function plot() for the scatter plots, there's a way :)

In [None]:
plt.plot(x, y, "o")
plt.plot(x, z, "o")

plt.title("A Company's Stock Prices During the First 100 days since IPO")
plt.xlabel("The number of days since IPO")
plt.ylabel("The prices of the stocks corresponding to days")
plt.legend(["Company A", "Company B"], loc = "upper right")
plt.show()

### The Curious Case of Anscombe's Quartet 
### ***or Why Data Visualization Matters in Statistical Modeling of Financial Data***

> Let’s consider here that you have received four datasets from four industries. 

> Each dataset contains companies within their respective industry. 

> You intend to analyze the relationship within each dataset regarding companies’ growth indicator and prices-to-earnings ratio to detect the relationships between the two variables.


In [None]:
from matplotlib.pyplot import subplot, scatter, plot, axis
from scipy.stats import linregress

> Suppose that you have the following datasets.

In [None]:
company_growth_1 = [10.0, 8.0,  13.0,  9.0,  11.0, 14.0, 6.0,  4.0,  12.0,  7.0,  5.0]
pe_ratio_1 = [8.04, 6.95, 7.58,  8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]

company_growth_2 = [10.0, 8.0,  13.0,  9.0,  11.0, 14.0, 6.0,  4.0,  12.0,  7.0,  5.0]
pe_ratio_2 = [9.14, 8.14, 8.74,  8.77, 9.26, 8.10, 6.13, 3.10, 9.13,  7.26, 4.74]

company_growth_3 = [10.0, 8.0,  13.0,  9.0,  11.0, 14.0, 6.0,  4.0,  12.0,  7.0,  5.0]
pe_ratio_3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15,  6.42, 5.73]

company_growth_4 = [8.0,  8.0,  8.0,   8.0,  8.0,  8.0,  8.0,  19.0,  8.0,  8.0,  8.0]
pe_ratio_4 = [6.58, 5.76, 7.71,  8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

> Let's perform a quick regression analysis.

In [None]:
slope, intercept, r_value, p_value, std_err = linregress(company_growth_1, pe_ratio_1)
print("{0:.3f} {1:.2f} {2:.3f} {3:.3f} {4:.3f}".format(slope, intercept, r_value, p_value, std_err))
slope, intercept, r_value, p_value, std_err = linregress(company_growth_2, pe_ratio_2)
print("{0:.3f} {1:.2f} {2:.3f} {3:.3f} {4:.3f}".format(slope, intercept, r_value, p_value, std_err))
slope, intercept, r_value, p_value, std_err = linregress(company_growth_3, pe_ratio_3)
print("{0:.3f} {1:.2f} {2:.3f} {3:.3f} {4:.3f}".format(slope, intercept, r_value, p_value, std_err))
slope, intercept, r_value, p_value, std_err = linregress(company_growth_4, pe_ratio_4)
print("{0:.3f} {1:.2f} {2:.3f} {3:.3f} {4:.3f}".format(slope, intercept, r_value, p_value, std_err))

> The regression line in each case is the interesting thing about these four datasets as all four have the same summary statistics. 

> That is, they have the same mean of X, the same variance of X, the same mean of Y, the same variance of Y, the same correlation, the same slope of the line, and the same intercept of the line, so the summary statistics for all four datasets are the same.

> Now, let's visualize your data, along with the regression modeling results.

In [None]:
xmax = 20
ymax = 20

plt.title('Data Visualization Helps Your Turn Data into Insights', fontdict={"fontname": "Comic Sans MS", "fontsize": 20})

ax1 = subplot(2, 2, 1)
scatter(company_growth_1, pe_ratio_1)
slope, intercept, r_value, p_value, std_err = linregress(company_growth_1, pe_ratio_1)
plot([0, xmax], [intercept, slope * xmax + intercept])
print("{0:.3f} {1:.2f} {2:.3f} {3:.3f} {4:.3f}".format(slope, intercept, r_value, p_value, std_err))

subplot(2, 2, 2, sharex=ax1, sharey=ax1)
scatter(company_growth_2, pe_ratio_2)
slope, intercept, r_value, p_value, std_err = linregress(company_growth_2, pe_ratio_2)
plot([0, xmax], [intercept, slope * xmax + intercept])
print("{0:.3f} {1:.2f} {2:.3f} {3:.3f} {4:.3f}".format(slope, intercept, r_value, p_value, std_err))

subplot(2, 2, 3, sharex=ax1, sharey=ax1)
scatter(company_growth_3, pe_ratio_3)
slope, intercept, r_value, p_value, std_err = linregress(company_growth_3, pe_ratio_3)
plot([0, xmax], [intercept, slope * xmax + intercept])
print("{0:.3f} {1:.2f} {2:.3f} {3:.3f} {4:.3f}".format(slope, intercept, r_value, p_value, std_err))

subplot(2, 2, 4, sharex=ax1, sharey=ax1)
scatter(company_growth_4, pe_ratio_4)
slope, intercept, r_value, p_value, std_err = linregress(company_growth_4, pe_ratio_4)
plot([0, xmax], [intercept, slope * xmax + intercept])
print("{0:.3f} {1:.2f} {2:.3f} {3:.3f} {4:.3f}".format(slope, intercept, r_value, p_value, std_err))

axis([0, xmax, 0, ymax])

> The lesson here is simply that summary statistics don’t tell the whole story. 

> Imagine if we were to look only at the output of these four datasets for a simple linear regression. 

> Everything would look exactly the same, but the reality of the situation is that things are very different.

> **Visualizing your data will give you a deeper understanding about quantitative information that you might think you know but you were not aware of: namely, `data visualization helps transform data into insights`.**

### Histogram

> A `histogram` is an efficient visualization toolkit `to examine whether your data is normally distributed`, or centered around the mean.

> Let's create one.

In [None]:
plt.hist(z, bins = 10, color ="purple")
plt.show()

> Histograms can also be used to compare the distributions of multiple datasets. 

> Let's compare the performance of two different stocks to find out which stock has more fluctuation.

##### How would you create two arrays that contain 365 random integers?

In [None]:
company_a = np.random.randint(50, 200, 365)
company_b = np.random.randint(30, 250, 365)

In [None]:
plt.hist(company_a, bins = 50, alpha = 0.5, color = "blue")
plt.hist(company_b, bins = 50, alpha = 0.5, color = "orange")
plt.show()

> Again, you can add a legend here.

> Legend will be useful when plotting multiple datasets to identify which plot is associated with a specific dataset.

> To add a legend, you can use the label argument. 

> To display the legend on the plot, you can use the function `plt.legend()`.

In [None]:
plt.hist(company_a, bins = 80, alpha = 0.5, color = "purple", label = "Company A")
plt.hist(company_b, bins = 80, alpha = 0.5, color = "grey", label = "Company B")
plt.legend()
plt.show()

### Bar Chart

In [None]:
tickers = ["AAPL", "MSFT", "GOOGL"]
values = [459.63, 208.90, 1504.63]

plt.figure(figsize=(7,5), dpi=100)

bars = plt.bar(tickers, values, color = ("red", "blue", "green"))

plt.show()

In [None]:
plt.figure(figsize=(10,8), dpi=100)

bars = plt.bar(tickers, values, color = ("red", "blue", "green"))

patterns = ["/", "O", "*"]

for bar in bars:
    bar.set_hatch(patterns.pop(0))

plt.savefig("barchart.png", dpi=300)

plt.show()

> `Thank you for working with the script :)`

In [None]:
exit()