# Overview of some of the Python packages.

***Disclaimer: I am using the words library, module, and package very loosely. And to be honest it is difficult to keep the distictions entirely clear. Just we aware that I am not applying these terms by their true definition. For our purposes the distinctions are largely insignificant.***

So what is this module/package/library business. To be clear for once let's lay the defitions out and then promptly forget them.
<dl>
    <dt>Module</dt>
    <dd>Simply a python file which contains python functions, global variables etc. It is an easy way to store and execute functions and tasks across python programs.</dd>
    <dt>Package</dt>
    <dd>A package is a collection of modules in a directory. No real difference from a Library that often is interfaced with through an API. They are both just more robust collections of modules and packages</dd>
    <dt>Library</dt>
    <dd>The library is not a strict unit in Python but is also used to describe a container for multiple packages/modules in a directory that often is interfaced with through an API. No real difference from a Package. They are both just more robust collections of modules and packages</dd>
<dl>

## Resources:
[Python Data Analysis Library](https://pandas.pydata.org/)
<br/>
[Pandas Documentation](http://pandas.pydata.org/pandas-docs/stable/)
<br/>A python tool for creating and anlaysing data structures.
<br/>
[Numpy Documentation](http://www.numpy.org/)<br/>
NumPy is the fundamental package for scientific computing with Python. Basically numpy is used for math and working wiht arrays.
[Scikit Learn Documentation](http://scikit-learn.org/stable/index.html)<br/>
Pythn tooks for basic machine learning, data mining, and data analysis.

## Numpy and Pandas


In [2]:
import pandas as pd
import numpy as np

%matplotlib inline

For example we can use numpy to generate a random array of integers and then use Pandas to put those in a data frame. 

In [None]:
values = np.random.randint(0,100,size=(60, 8))
values

In [None]:
# we can flatten the array, sommetimes useful in NLP.
r = np.ravel(values)
r

In [None]:
r.shape

In [None]:
# create a dataframe from the array
data = pd.DataFrame(values,columns=['A','B','C','D','E','F','G','H'])

In [None]:
data

Pandas and NumPy provide us with a myriad of ways to change, manipulate, analyze, and visualize data. 

In [None]:
data.describe()

In [None]:
data

### Visualizing the data

In [None]:
data.plot(kind='box')

In [None]:
data.plot(kind='kde')

In [None]:
data.plot(kind='scatter',x='A',y='B')

<hr style="height:2px"/>
# Scikit Learn

Machine leanring is just an algoritm that a computer uses to make generalizations about data.

* [scikit-learn](http://scikit-learn.org/stable/)<br/>
This is and open source package built on several libraries: NumPy, SciPy, and matplotlib

## The Machine Learning Process
1. Define the problem and indetify the research question.
2. Find the Data
3. Prepare the data (munging)
4. Select the algorithm
5. Train the model
6. Test the model
7. Communicate/share results

## Types of Machine Learning algorithms:
* Classification: assign some data to a class. Is email spam.
* Regression: predict a value based on a set of features: based on location an dsize what is the price of a house.
* Clustering: looking for patterns in data. Categorizing text
* Rule Extraction: Recommendation engines. Given a set of data and some rules we can make a prediction.

### Example of machine learning with classification
Is a flower a daisy or a rose. 
The corpus is the set of data we have that gives us the information the algorithm will use to learn how to distinguish the characteristics of daisies and roses.

We need to choose and Algorithm.

When the algorithm is chosen we train it on a set of training data from the corpus. E.g. white pedals = daisy, red = rose.

Input vectors are the color of the pedal. The output we desire is what type of flower we get.



## Supervised and Unsupervised learning
### supervised
These algorithms that have lables associated with the training data that are used to correct and tune the model. For example, we have location and size data for a house and we want to predict the price. We know the labels (price and size) and we want to learn how the predict cost. Using the data corpus we can adjust the model so that it can predict a target variable (e.g. price). 

This data set has a set of feature vectors or variables, the x variables. 

The attribute that we want to predict is called the label, or the y variable.


### unsupervised
The model is set up to learn the patterns and structures in the data when we do not know what the correlations might be. Given a large set of data these algorithms look for relationships that are not apparent.

We have no labels, we do not know what our y variable is, and we only have the x variables. From this data unsupervised algorithms look for patterns.

Clustering is a common type of unsupervised machine learning, which is a methof of identifying groups of similar content after an analysis of a selected group of features.

## Data types
### Continuous

This is data that has an infinite set of values in a range. These are values like salary, height, weight, age.

### Categorical Data
This is data that has a limited number of values. These are values like Month, color, day of the week, yes/no.

This type of data canot necessarily be ordered.


## Scaling Data
This is also called normalization. The purpose is to standardized the variability in feature details. The exam data below has student scores and it is not clear if they are scores out of 100, out of 120, etc. To work with this data we need to standardize it. 

## Feature Vectors

First know that a vector is simply a list of numbers (1,2,5,3). To visualize what this means for machine learning it helps to think of a 2-dimensional vector that can be plotted on a cartesian plane. 

<div style="text-align:center">
    <img src="images/vector2D.png" style="margin-top:2em;width:40%"/>
</div>

Vectors can contain a number of numbers, but it is difficult for us to visualize such a complicated system of values.

In machine learning each measurement we have is a feature. You can think of this as the columns in a table. If we are trying to determing the cost of a house, some features we might consider are the number of bedrooms, the square footage, the number of bathrooms, etc. These data points are our features and a collection of these data points for one house would be the feature vector for that house. 

The key is to identify features that can correlate to the target variable you want to predict. To continue the house example, including the weight of the home owner would probably not be a useful feature for determining the proce of a house, while looking at the size of the house in square feet might be.

This is where domain knowledge is very valuable for identifying useful features. A environmental scientist could be better suited for identifying useful features when trying to predict climate change than a software developer. 

However, another way to look at feature determination is to look at it from a mathematical point of view and see what features correlate mathematically to the target variable best. This is often a useful approach and one that different algorithms rely on as well.

The combination of these two approaches is often the most effective. 

## Data Science Exercise - feature extraction: let's make some features

Let's look at some student exam data and see how we can scale some of the variables and transform some of the data into useful features (aka **feature extraction**).

In [None]:
import sklearn

First we bring the data into the notebook and save it as a Pandas dataframe

In [None]:
test_data = pd.read_csv('data/test.csv')
test_data

We don't know anything about this data except what we have in front of us in this instance. This may often be the case. Be careful to avoid assumptions. The score numbers look like they are percentages, but we have no way of confirming this. These could be points and we have no access to the scale. Scikit learn provides a means to standardize these data based on mean standard deviation.

In the cell below we will scale the data and assign the results back to the column.

In [None]:
from sklearn import preprocessing
test_data[['math score']] = preprocessing.scale(test_data[['math score']])
test_data[['reading score']] = preprocessing.scale(test_data[['reading score']])
test_data[['writing score']] = preprocessing.scale(test_data[['writing score']])

In [None]:
test_data

### One hot encoding for categorical data
This identifies a value and uses a 0 or 1 to indicate if it is part of a category or not. We can do this type of categorization easily with scikit learn's LabelEncoder. It can generate this type of encoding automatically for a column. 

Gender is a perfect column for this type of feature extraction. A computer can do nothing with the values "female/male" (e.g. what is "male" + "male"?). However a computer can evaluate the number 1 or 0. So we can use 0 t indicate "female" and 1 to indicate "male".


In [None]:
# This let's the LabelEncoder know what type of data it is going to be handling. 
le = preprocessing.LabelEncoder()
le.fit(test_data['gender'])

The classes reflect the labels of the original data.

In [None]:
le.classes_

Now we can use the LabelEncoder to replace the data inthe gender column.

In [None]:
test_data['gender'] = le.transform(test_data['gender'])

In [None]:
test_data.head()

Pandas also has a great way to do this kind of "one hot encoding". 

In [None]:
# this will return a new dataframe of the one hot data.
pd.get_dummies(exam_data['parental level of education'])

In [None]:
test_data.head()

Now we put this in our original test_data dataframe.

In [None]:
test_data = pd.get_dummies(test_data, columns=['parental level of education'])

In [None]:
test_data.head()

We can look at the "test preparation course" column and see what values we have there.

In [None]:
# can this also be one hot encoded.
test_data['test preparation course'].value_counts()

Let's transform the rest of our columns to one hot encoding data so the computer can make sense of it. 

In [None]:
test_data = pd.get_dummies(test_data, columns=['lunch','test preparation course'])

In [None]:
test_data.head()

In [None]:
test_data.tail()

In [None]:
test_data.describe()

Now we have a collection of features that can be processed mathematically by a computer. 

### Data visualization and Exploration

If we plot out the dat for the scores we can see that there seems to be a strong corelation between reading score and writing score. 

In [None]:
%matplotlib inline
test_data.plot(kind='scatter',x='reading score',y='math score')

In [None]:
test_data.plot(kind='scatter',x='reading score',y='writing score')

In [None]:
test_data['gender'].value_counts().plot(kind='pie', x='lunch_standard', label="female = 0")



## Using numeric data to analyze text
So what does this mean for text data?

How can we use numeric data to analyze text. Let's think about this in the context of a sentiment analyzer for a review. Imagine our review of the movie is: "This was the best movie!"

We can imagine our `text = 'This was the best movie!'`

Well we can tokenize it to `tokens = ['This', 'was', 'the', 'best', 'movie']`

We could use one hot method on the set of vocabulary so that if a sentence has certain words we can esatblish a correlation. For example, the `tokens` document would become something like `[('This',1),('was',1), ('the',1), ('best',1), ('movie',1)].`

Probelms with this is that it does not count frequency if we reduce its vocabulary to generate the one hot integers. It also doe not take into account order. And as the corpus grows its vocabulary the matrix becomes enormous, more like a sparse matrix. This is not a very intelligent way to represent the sentence information.

A better method is Frequence based embeddings. 

### Frequency based embeddings
These rely on count (how many times a word has occured in a document) and Term Frequency - Inverse Document Frequency(TF-IDF)
#### Count
If we see that the word 'bad' occurs more frequently in a review then we can say the movie is reviewed negatively. 

Problems arise here agian:
* If the corpus is very large then again we have a huge set of feature vectors (the vocabulary). 
* We are still losing the context of the words.
* Semantics and relationships are not preserved

Some solutions to deal with a large vocabulary could be to limit the number of words we choose. We could rely on the top 50 words in our word count but we have seen before that this favors words that might not convey any useful information (e.g. "a", "the", "an"). To remedy this we may filter out stopwords that do not affect the sentiment of a review.

Another method is to hash words to buckets to reduce the vocabulary size. This means if we have 10,000 words we can create 8,000 buckets. We need to be careful to not create buckets that minimize the amount of collisions, that is we don't have two words that map to the same bucket. For example we could put the words "actor" and "actress" into one bucket if we are trying to evalute if a review is describing good or bad acting. THese two words are now treated as one and counted as one. So the the sentence "the actor was great and the actress was unknown" would get 2 counts for the "actor, actress" bucket. We would never know that these are separate words however in our count, only that this bucket has a higher count than other buckets.

#### TF-IDF (term frequency - inverse document frequency)
This is a measure of how often a word occurs in a document compared to how often it occurs in a corpus.

TF-IDF looks at how often a term appears in a single document. So the more often a word appears in the document it is likely to be considered important. This is considered against how often a word appears in the corpus. So the logic is that a word that appears more frequently in a document but less frequently in the corpus is a significant term for that document. 

This has the advantage of reducing the feature vector size and giving relevance to frequency in a document. However we still have the draw back that the context of the word is not captured.



## Vectorizing data (aka turning text into numbers)
We will start with a very small corpus of data so we can see what is going on

In [None]:
# a small corpus of four documents.
corpus = ['The first wish I have is for a dog. I love dogs.',
         'The second wish I have is for a cat.',
         'Is the third wish on my wish list a polar bear? Is it?',
         'Number four of the things I want is a cow. Number four, cow. Yup.',
         'The last thing I ever want is a hot dog']

Try using the Notebooks `shift + tab` feature to learn more about how the countVectorizer works. also note we are taking it from a submodule of Scikit Learn called `feature_extraction.text`. This submidule provides tools for building feature vectors from text documents.  

In [None]:
# This is a simple frequency based vectorizer
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# set up the vectorizer
vectorizer = CountVectorizer()
vectorizer.fit(corpus)
word_count = vectorizer.transform(corpus)

Note we have creates a sparse matrix. This is just a matrix that contains a lot of empty, or 0, vlaues, which is often the case when working with word counts in a corpus.

In [None]:
word_count

In [None]:
# this will show the matrix broken up by document, word represented by an integer and word count. 
# Note that the order of the array does not reflect the order of the documents in the corpus.

print(word_count)

<p>If we take one of these lines, `(4,26)     1`, 4 indicates the document (remember not necessarily the 4th documentas our corpus is laid out), 26 is the word id, and the trailing 1 is the count. </p>
<p>Let's take an even closer look at what is in this matrix.</p>


In [None]:
# we can see the id of a word:
vectorizer.vocabulary_.get('wish')

In [None]:
# look at the entire vocabulary and the ids
vectorizer.vocabulary_

Notice 'dog' and 'dogs' are counted separately.

In [None]:
# get a list of the feature names.
vectorizer.get_feature_names()


Now let's look at this information as data in a data frame using pandas.

In [None]:
corpus_data = pd.DataFrame(word_count.toarray(), columns=vectorizer.get_feature_names())
corpus_data

Now we could do some data exploration with pandas if we wanted.

We don't have enough data here but if we had several more hundreds of rows we might want to use unsupervised learning to look for patterns in these "documents". Hopefuly we would see that the alogrithm would group them into wishes for dogs, wish for cats, and wishes for cows or bears. 

### TF-IDF Vectorizer

Now lets work with the TF-IDF vectorizer

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
tfidf.fit(corpus)
word_tfidf = tfidf.transform(corpus)

In [None]:
# Here we will see a simlar matrix.
# In this case each vocabulary word gets a td-idf score in stead of a count.
print(word_tfidf)

We can access this matrix much in the same way we accessed the countVectorizer matrix.

In [None]:
tdidf.vocabulary_

In [None]:
corpus_data = pd.DataFrame(word_tdidf.toarray(), columns=tdidf.get_feature_names())
corpus_data

We do not have a large vocabulary but imagine we do. Maybe we would want to reduce the feature count we could hash the features. Scikit Learn also gives us a hashing tool.

In [None]:
from sklearn.feature_extraction.text import HashingVectorizer

# here n_features is n number of buckets
hashing = HashingVectorizer(n_features=18)
word_hash = hashing.fit_transform(corpus)
print(word_hash)

Notice that now where we had our word count up to 27, now we only have 18 values (0-17) and the frequency is not represented as a count but as some normalized data. One downside to hashing is that we have lost the link back to the vocabulary, so tracing back is not possible.


In [None]:
word_hash.get_feature_names()

In [None]:
word_hash.get_shape()

In [None]:
hash_data = pd.DataFrame(word_hash.toarray())

In [None]:
hash_data