# <center>Welcome to Intro to Natural Language Processing</center>
# <center>and Machine Learning in Python!</center>

# <center>Please go to ccv.jupyter.brown.edu</center>

# <center> What you learnt so far ... </center>
## Coding in general:
### - variables and control flow
### - used common container types
### - used functions, and list comprehensions


# <center>Packages in python </center>
## Packages make your life easy
#### - e.g., numpy, pandas, matplotlib, scikit learn
#### - you don't need to write a function to calculate the mean of some numbers, call np.mean() instead
#### - written by software engineers, thoroughly tested by users

# <center>Packages in python </center>
## Drawback of packages:
#### - ALWAYS ALWAYS carefully read the help or the manual!
#### - you need to know what you are doing
## Only use a function if you understand it well enough to write it yourself *given enough time*.

## Every coding task you want to do has already been done by someone else before!
#### - if you are not sure what package or function to use, google it
#### - stackoverflow is an awesome knowledge base for coding problems

## <center> Visualization with matplotlib </center>
### By the end of this and the next session, you will be able to
#### - prepare basic plots using matplotlib (line, scatter, bar, histogram, heatmap)
#### - identify and fix common errors that occur while plotting
#### - design effective plots to deliver your intended message

# <center> Let's generate some data to plot first! </center>
## Question:
## How many documents contain 0-100, 100-200, 200-300, ... words?

In [None]:
import numpy as np
np.random.seed(0)

# simulated word counts
nr_words = np.random.lognormal(6, 1, 1000).astype(int)

print('The shortest document has ', np.min(nr_words), ' words.')
print('The longest document has ', np.max(nr_words), ' words.')

print(nr_words)

In [None]:
# let's bin these numbers
bin_edges = np.arange(0,10000,100)

print(bin_edges)

nr_docs_in_bins = np.zeros(len(bin_edges)-1)

for i in range(len(nr_docs_in_bins)):
    nr_docs_in_bins[i] = sum([1 for nr_word in nr_words if (nr_word > bin_edges[i])&(nr_word < bin_edges[i+1])])
    
    # let's print the results
    print('There are',nr_docs_in_bins[i],'documents that contain',bin_edges[i],'to',bin_edges[i+1],'words.')


### Exercise 1
#### How would you test that the code works properly?

In [None]:
bin_edges = np.arange(0,10000,100)

nr_docs_in_bins = np.zeros(len(bin_edges)-1)

for i in range(len(nr_docs_in_bins)):
    nr_docs_in_bins[i] = sum([1 for nr_word in nr_words if (nr_word > bin_edges[i])&(nr_word < bin_edges[i+1])])

    # let's print the results
    #print('There are',nr_docs_in_bins[i],'documents that contain'\
    #      ,bin_edges[i],'to',bin_edges[i+1],'words.')

# add test here


## Exercise 2
### Why aren't all the documents counted?
### What's wrong with the code?


In [None]:
bin_edges = np.arange(0,10000,100)

nr_docs_in_bins = np.zeros(len(bin_edges)-1)

for i in range(len(nr_docs_in_bins)):
    nr_docs_in_bins[i] = sum([1 for nr_word in nr_words if (nr_word >= bin_edges[i])&(nr_word < bin_edges[i+1])])

    # let's print the results
    #print('There are',nr_docs_in_bins[i],'documents that contain',bin_edges[i],'to',bin_edges[i+1],'words.')

# add test here

## <center> Simple line plot using plt.plot </center>
### Always carefully read the help of the function you want to use first!
#### My experience is that most errors/bugs occur because I didn't read the help carefully enough.

In [None]:
import matplotlib.pyplot as plt
import matplotlib

help(plt.plot)

# <center> General matplotlib code structure </center>

In [None]:
# import matplotlib
import matplotlib.pyplot as plt
#import matplotlib

# define your canvas
plt.figure(figsize=(8,6))

# draw on the figure, define the properties of the line or marker
plt.plot(x1,y1,'bo',markersize=12)
plt.plot(x2,y2,'--')

# modify the properties of the figure
# axis labels, title, x and y limits, etc.
plt.xlim([x_low,x_high])
plt.ylim([y_low,y_high])
plt.xlabel('the name of quantity x')
plt.ylabel('the name of quantity y')
plt.title('plot title')

# show the figure in the notebook or save it as a file
plt.show()
plt.savefig('figure.jpg',dpi=150)

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(8,6))
plt.plot(nr_docs_in_bins)
plt.show()

## Exercise 3
### How would you improve this plot?

In [None]:
# x should not be the index of an array but the number of words

plt.figure(figsize=(8,6))
plt.plot(bin_edges,nr_docs_in_bins)
plt.show()

In [None]:
plt.figure(figsize=(8,6))
bin_center = (bin_edges[1:] + bin_edges[:-1])/2
plt.plot(bin_center,nr_docs_in_bins)
plt.show()

In [None]:
# add x and y labels, and a title
plt.figure(figsize=(8,6))
bin_center = (bin_edges[1:] + bin_edges[:-1])/2
plt.plot(bin_center,nr_docs_in_bins)
plt.xlabel('nr. of words in doc')
plt.ylabel('nr. of docs')
plt.title('histogram')
plt.show()

In [None]:
# increase font size
import matplotlib
matplotlib.rcParams.update({'font.size': 16})

plt.figure(figsize=(8,6))
bin_center = (bin_edges[1:] + bin_edges[:-1])/2
plt.plot(bin_center,nr_docs_in_bins)
plt.xlabel('nr. of words in doc')
plt.ylabel('nr. of docs')
plt.title('histogram')
plt.show()

In [None]:
# focus on the important area
matplotlib.rcParams.update({'font.size': 16})

plt.figure(figsize=(8,6))
bin_center = (bin_edges[1:] + bin_edges[:-1])/2
plt.plot(bin_center,nr_docs_in_bins)
plt.xlim([0,2000])
plt.xlabel('nr. of words in doc')
plt.ylabel('nr. of docs')
plt.title('histogram')
plt.show()

In [None]:
# plot the points and the lines

plt.figure(figsize=(8,6))
bin_center = (bin_edges[1:] + bin_edges[:-1])/2
plt.plot(bin_center,nr_docs_in_bins,'o')
plt.plot(bin_center,nr_docs_in_bins)
plt.xlim([0,2000])
plt.xlabel('nr. of words in doc')
plt.ylabel('nr. of docs')
plt.title('histogram')
plt.show()