# Data Visualization Basics

## Why do we need Data visualization

It is a fundamental part of a data scientists toolkit. We primarily use data visualization:
- To *explore* data
- To *communicate* data

Goals for this week:
- Concentrate on building skills that we will need to start exploring data on our own and to produce relevant visualizations.

# Matplotlib

Link: https://matplotlib.org/

Note: Matplotlib like pandas is not a core part of the Python Library, therefore we have to download it by using:

```python -m pip install matplotlib```

We will be using ```matplitlib.pyplot``` module. ```pyplot``` maintains an internal state in which we can build visualization step by step.



In [None]:
from matplotlib import pyplot as plt

In [None]:


years = [1950,1960,1970,1980,1990,2000,2010]
gdp = [300.2,543.2,1075.9,2862.5,5979.6,10289.7,14958.3]

# Create a line chart, 
# x-axis : years
# y-axis : gdp

plt.plot(years,gdp,color = 'red',marker = 'x',linestyle = 'solid')

# add a title 
plt.title("Nominal GDP")

# Add a label to the y-axis
plt.ylabel("Billions of $")
plt.show()

## Bar Charts

- A good choise when we want to show that *some quantity varies amound some discrete set of items*. 

Example :
### How many academy awards were won by each movie

In [None]:
movies = ["Annie Hall","Ben-Hur","Casablanca","Gandhi","West Side Story"]
num_of_oscars = [5,11,3,8,10]

# plot bars with 
# left x-coordinates [0,1,2,3,4]
# y-cordinates [num_of_oscars]
plt.bar(range(len(movies)),num_of_oscars)

# add title
plt.title("My favourite Movies")

# label the y-axis
plt.ylabel("# of Academy Awards")
# Label x-axis with movie titles
plt.xticks(range(len(movies)),movies)

plt.show()

Another good use of a bar chart can be for plotting histograms of numeric values (bucketed). This can help us visualize distributions

### Example: Grade Distribution

In [None]:
from collections import Counter
grades = [83,95,91,87,70,0,85,82,100,67,73,77,0]


# Buckt grades by decile, but put 100 in with the 90s

histogram = Counter(min(grade//10*10,90)for grade in grades)

plt.bar([x + 5 for x in histogram.keys()],    # Shift bars right by 5
        histogram.values(),             # Give each bar its correct height
        10,                             # Give each bar a width of 10
        edgecolor = (0,0,0)             # Black edges for each bar
        )            

plt.axis([-5,105,0,5])                  # x-axis from -5 to 105
                                        # y-axis from 0 to 5

plt.xticks([10 * i for i in range(11)]) # x-axis labels at 0,10, ..., 100

plt.xlabel("Decile")
plt.ylabel("# of Students")
plt.title("Distribution of Exam Grades")
plt.show()

### Examining the code
- Notice the third argument to the ```plt.bar```, this specified the bar width. 
- We also shifted the bars right by 5, so that, for example, the "10" bar (which corresponds to 10-20 would have its center at 15).
- We also added a black edge to each bar to make them visually distinct
- The call to ```plt.axis``` indicates that we want the x-axis to range from -5 to 105 (to leave a little space on the left and right)
- The y-axis should range from 0 to 5
- Lastly, ```plt.xticks``` puts x-axis labels at 0,10,20, ..., 100.

## Line Charts

The good thing is it is easy to make line charts simple using ```plt.plot``` these are good for showing trends.


In [None]:
variance = [1,2,4,8,16,32,64,128,256]
bias_squared = [256,128,64,43,16,18,4,2,1]
total_error = [x + y for x,y in zip(variance,bias_squared)]
xs = [i for i, _ in enumerate(variance)]


# We can make multiple calls to plt.plot
# to show multiple series on the same chart

plt.plot(xs,variance, 'g-', label = 'variance')
plt.plot(xs,bias_squared, 'r-.',label = 'bias squared')
plt.plot(xs,total_error, 'b:',label = 'total error')


# Because we've assigned labels for each series, 
# we can get a legend 
plt.legend(loc = 9)
plt.xlabel("model Complexity")
plt.xticks([])
plt.title("The Bias-Variance Tradeoff")
plt.show()


## Scatterplots

A scatter plot is the right choice for visualizing the relationship between two paired sets of data. 


In [None]:
friends = [70,65,72,63,71,64,60,64,67]
minutes = [175,170,205,120,220,130,105,145,190]
labels = ['a','b','c','d','e','f','g','h','i']

plt.scatter(friends,minutes)

# label each point 
for label,friend_count,minute_count in zip(labels,friends,minutes):
  plt.annotate(label,
               xy=(friend_count,minute_count), # Put the label with its point
               xytext = (5,-5),
               textcoords = 'offset points'
               )
  
plt.title("Daily minutes vs Number of Friends")
plt.xlabel("# of friends")
plt.ylabel("Daily minutes spent on the site")
plt.show()

## Figures and Subplots

Plots in matplotlib reside within a ```Figure``` object. We can create a new figure with ```plt.figure```

In [None]:
import matplotlib.pyplot as plt
fig = plt.figure()


Lets dive into ```plt.figure```, it has a number of options; notably; figsize will guarantee the figure has a certain size and aspect ratio if saved to disk.

**Note**: In Co-lab/Jupyer nothing will be shown until a few more commmands are entered. We cannot make a plot with a blank figure. Therefore we will create one or more subplots using ```add_subplot```

In [None]:
ax1 = fig.add_subplot(2,2,1)

The above code meand that our figure should be ```2x2``` which means upto 4 plots in total and we have selected the 1st of the 4.

If we create the next two subplots,we'll end up with a visualization that looks exactly like 

In [None]:
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)

In [None]:
fig

**Next**, we will run all the commands together. One slight change will be added a subplot. 

In [None]:
import numpy as np
fig = plt.figure()
ax1 = fig.add_subplot(2,2,1)
ax2 = fig.add_subplot(2,2,2)
ax3 = fig.add_subplot(2,2,3)

plt.plot(np.random.randn(50).cumsum())

**Notice Something?**


We did not ascribe a subplot figure to the plot, it automatically took the last plot.



In [None]:
ax1.hist(np.random.rand(100),bins=20,color = 'k',alpha = 0.3)
ax2.scatter(np.arange(30),np.arange(30)+3 * np.random.randn(30))

fig

## Colors, Markers, and Line Styles

By now you should have realized that the main function you are using in Matplotlib is the ```plot``` function.
The function accepts arrays of x and y coordinates and optional arguments such as color and line style and figure size.

```ax.plot(x,y,'g--')```

We can show the same plot more explicitly by adding a linestyle:

```ax.plot(x,y,linestyle='--',color='g')```

Line plots can also have *makers* in order to highlight the actual data points. 

When matlab creates plots, they are a continous line plot (interpolating between points), it can occasionally be unclear where the points are. Markers help us observe the interpolation in a clearer manner.

In [None]:
import numpy as np

fig = plt.figure()
ax1 = fig.add_subplot(2,1,1)
ax2 = fig.add_subplot(2,1,2)

plt.plot(np.random.randn(30).cumsum(),'bo--') # Plot 2
# Which is the same as:
ax1.plot(np.random.randn(30).cumsum(),color = 'b',linestyle = '--',marker = 'o') # Plot 1



For line plots, we can notice that the points are interpolated linearly by default.

We can change this by using the ```drawstyle``` option.

In [None]:
data = np.random.randn(30).cumsum()

plt.plot(data,'k--',label = 'Default')
plt.plot(data,'k--',drawstyle = 'steps-post',label = 'steps-post')
plt.legend()

## Ticks, Labels and Legends

The ```pyplot``` interface designed for interactive use, consists of methods like:
- ```xlim```
- ```xticks```
- ```xticklabels```

These methods control the plot range, tick locations and tick labels.
There are two ways to apply such parameters:
1. Called with no arguments returns the current parameter value (e.g, ```plt.xlim()``` returns the current x-axis plotting range.
2. Called with parameters sets the parameter value(e.g, plt.xlim([0,10]),sets the x-axis range from 0 to 10)

** Setting the title, axis labels, ticks and ticklabels**

Lets look at the plot below

In [None]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.plot(np.random.randn(1000).cumsum())

**Next**, we are going to change the x-axis ticks, it's easier to use ```set_xticks``` and ```set_xticklabels```. The former instructs matplot lib where to place the ticks along the data range. However, we can set any other values as the labels using set_xticklabels: 

In [None]:
ticks = ax.set_xticks([0,250,500,750,1000])
labels = ax.set_xticklabels(['one','two','three','four','five'],
                            rotation = 30,
                            fontsize = 'small')

#fig

The ```rotation option sets the x-tick labels at a 30-degree rotation.
Lastly, ```set_xlabel``` gives a name to the x-axis and ```set_title``` the subplot title.

In [None]:
ax.set_title('My first matplotlib plot')
#fig


In [None]:
ax.set_xlabel('Stages')
fig

In [None]:
# Bonus: we can also write this as:
props = {
    'title': 'This is my matplotlib plot',
    'xlabel': 'All stages'
}
ax.set(**props)
fig

**Adding Legends**

Legends are an important element in order to identify our elements. The easiset way to add one is to pas the label arguemtn when adding each piece

In [None]:
from numpy.random import randn

fig = plt.figure()
ax=fig.add_subplot(1,1,1)
ax.plot(randn(1000).cumsum(),
        'b',
        label ='one')
ax.plot(randn(1000).cumsum(),
        'r--',
        label = 'two')
ax.plot(randn(1000).cumsum(),
        'g.',
        label = 'three')



In [None]:
ax.legend(loc = 'best') # The loc method tells matplotlib where to place the plot. 
                        # if you're not picky 'best' works.
fig

**Annotations and Drawing on a Subplot**

In addition to standard plot types, we may wish to draw our own plot annotations. These can consist of text, arrows or other shapes. We can add annotations and tex tusing the ```text```,```arrow``` and ```annotate``` functions. ```text``` draws test at given coordinates (x,y) on the plot with optional custom syling:

```ax.text(x,y,'Hello world!',family='monospace',fontsize=10)```


**Saving plots to File**
We can save the active figure to file using plt.savefig. This method is equivalent to the figure object's ```savefig``` instane method.

Examples:

```plt.savefig('figpath.avg')```

*Note: The file type is inferred from the file extension. So for example if we used .pdf instead, we woulg get a pdf.*

Some important options for publishing graphics are:
- ```dpi``` : controls the dots-per-inch resolution.
- ```bbox_inches``` : Trims the whitespace around the actual figure.

```plt.savefig('figpath.avg',dpi = 400,bbox_inches = 'tight')```


# Pandas

Matplot is fairly a low level tool. This is because we are assembling a plot form its base components, e.g:
- Type of plot
  - line
  - bar
  - box
  - scatter
  - countour etc,
- legend
- title
- tick labels

With pandas we have multiple columns of data with row and column labels, pandas has built in methods that simplify creating visualizations from DataFrames and Series Objects.

Another library is ```seaborn``` which is a statistical graphics library.
We will be using seaborn in the latter half of our course.

For now lets plot using Pandas

In [None]:
import pandas as pd
import numpy as np

## Line Plots


### Series

In [None]:
s = pd.Series(np.random.randn(10),index = np.arange(0,100,10))
s.plot()

In this plt above Here are the important points we should observe:
- the Series object's index is passed to matplotlib for plotting on the x-axis, though we can disable this by passing ```use_index = False```.
- The x-axis and y-axis properties can be modified by using ```xticks```, ```xlim```, ```yticks``` and ```ylim```



### DataFrame

In [None]:
df = pd.DataFrame(np.random.randn(10,4),
                  columns=['A','B','C','D'],
                  index = np.arange(0,100,10)
                  )

df.plot()

The ```plot``` attribute contains a "family" of methods for different plot types

## Connecting to Your Google Drive


In [None]:
# Start by connecting google drive into google colab

from google.colab import drive

drive.mount('/content/gdrive')

In [None]:
!ls "/content/gdrive/My Drive/DigitalHistory"

In [None]:
cd "/content/gdrive/My Drive/DigitalHistory/Week_3"


In [None]:
ls

# Mapping the California Housing

## Import Libraries and unpack file



In [None]:
import pandas as pd
import zipfile
import numpy as np

In [None]:
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

In [None]:
fetch_housing_data()

In [None]:
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

In [None]:
housing = load_housing_data()
housing.head()

## Visualizing the DataFrames

In [None]:
import matplotlib.pyplot as plt

housing.hist(bins=50, figsize=(20,20))

plt.show()

In [None]:
housing["median_income"].hist()

In [None]:
housing['ocean_proximity'].hist()

## Mapping Geographical Data

### Step 1


In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude")


By plotting the longitude vs the latitude we can clearly see that it's California. However, note that it's almost impossible to see any particular pattern. 
The next step is to be able to separate the high density data points from the lower ones. 

For this we will use the ```alpha``` option in the plot function. We will set alpha to 0.1

### Step 2


In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)


You can see the difference between the high-density areas, for example the Bay Area, Los Angeles, San Diego and a little in the Central Valley.

Now we have a pattern, but it's not something very useful to us. So lets play with the visualization a little.

Note: We will only be using ```matplotlib```s ```pyplot``` function for our visualization.

What we will do is add the following parameters:
- **s** 
style : list or dict
The matplotlib line style per column.  
- c
- cmap
- colorbar

Other add-ons are:
- label 
- figsize



### Step 3

In [None]:
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()



#colormap

### Step 4

In [None]:

import matplotlib.image as mpimg
california_img=mpimg.imread('images/california.png')
ax = housing.plot(kind="scatter",
                  x="longitude",
                  y="latitude",
                  figsize=(10,7),
                  s=housing['population']/100,
                  label="Population",
                  c="median_house_value",
                  cmap=plt.get_cmap("jet"),
                  colorbar=False,
                 alpha=0.4,
                      )
plt.imshow(california_img, extent=[-124.55, -113.80, 32.45, 42.05], alpha=0.5,
           cmap=plt.get_cmap("jet"))
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)

prices = housing["median_house_value"]
tick_values = np.linspace(prices.min(), prices.max(), 11)
cbar = plt.colorbar()
cbar.ax.set_yticklabels(["$%dk"%(round(v/1000)) for v in tick_values], fontsize=14)
cbar.set_label('Median House Value', fontsize=16)

plt.legend(fontsize=16)


# Extras [FOR NOW]



In [None]:
fig = plt.figure()
ax = fig.add_subplot(1,1,1)

rect = plt.Rectangle((0.2,0.75),0.4,0.15,color = 'k',
                     alpha = 0.3)
circ = plt.Circle((0.7,0.2),0.15,color = 'b',alpha = 0.3)
pgon = plt.Polygon([[0.15,0.15],[0.35,.4],[0.2,0.6]],color = 'g',alpha = 0.3)

ax.add_patch(rect)
ax.add_patch(circ)
ax.add_patch(pgon)