# Visualizing PCA, NMF, tf-idf

### Dimension Reduction
* Dimension reduction finds patterns in data and uses these patterns to re-express it in a compressed form
* More effecient storage and computation- a big deal in a world of big data sets
* Removes less-informative "noise" features (which cause problems for prediction tasks
* In many real-world scenarios, it's dimension reduction that makes prediction possible.
* PCA is the most fundamental of dimension reduction techniques

### PCA = Principal Component Analysis
* Fundamental dimension reduction technique
* Performs dimensino reduction in two steps
    1) __"Decorreation"__ (doesn't reduce the dimension at all
    2) __Dimension Reduction__
    
#### 1) Decorrelation:
* In this first step, PCA rotates data samples to be aligned with axes
* PCA shifts data samples so they have mean 0
* Note that no information is lost

* PCA follows the fit/transform pattern
* `PCA` is a scikit learn component like `KMeans` or `StandardScaler`
* `fit()` learns the transformation from given data (learns how to shift and how to rotate the samples but doesn't actually change them)
* `transform()` applies the learned transformation that `fit()` learned

```
from sklearn.decomposition import PCA
model = PCA()
model.fit(samples)
transformed = model.transform(samples)
```
* This returns a new array of transformed samples
* This new array has the same number of rows and columns as the original sample array. In particular there is one row for each transformed sample.
* Columns of the new array correspond to PCA features 
* __PCA features are not correlated__
    * PCA, due to the rotation (/alignment along the axis) of the data, de-correlates it.
* "Principal Components" = "directions of variance"
* PCA aligns principal components with the axes
* After a PCA model has been fit, principal components are available as `.components_` attribute. This is a numpy array with one row for each observation.
   
    
* Linear correlation can be measured with __Pearson Correlation Coefficient__
    * Values between -1 and 1 
    * Value of 0 means no linear correlation 
    * The closer the value to either 1 or -1, the higher the correlation

```
#Perform the necessary imports
import matplotlib.pyplot as plt
from scipy.stats import pearsonr
#Assign the 0th column of grains: width
width = grains[:,0]
#Assign the 1st column of grains: length
length = grains[:,1]
#Scatter plot width vs length
plt.scatter(width, length)
plt.axis('equal')
plt.show()
#Calculate the Pearson correlation
correlation, pvalue = pearsonr(width,length)
#Display the correlation
print(correlation)
```

```
#Import PCA
from sklearn.decomposition import PCA
#Create PCA instance: model
model = PCA()
#Apply the fit_transform method of model to grains: pca_features
pca_features = model.fit_transform(grains)
#Assign 0th column of pca_features: xs
xs = pca_features[:,0]
#Assign 1st column of pca_features: ys
ys = pca_features[:,1]
#Scatter plot xs vs ys
plt.scatter(xs, ys)
plt.axis('equal')
plt.show()
#Calculate the Pearson correlation of xs and ys
correlation, pvalue = pearsonr(xs, ys)
#Display the correlation
print(correlation)
```

### Intrinsic dimension
* __Intrinsic dimension:__ the intrinsic dimension of a dataset is the number of features _needed_ to approximate the dataset. 
* Essential idea behind dimension reduction 
* What is the most compact representation of the samples?
    * Example: airplane location
    * you have a dataset of longitude, latitude coordinates for an airplane (2-D)
    * dataset is intrinsically 1-D, by using _location displacement_ instead of location coordinates.
* __PCA identifies intrinsic dimension when samples have any number of features.__
* __Intrinsic dimension = number of PCA features with significant variance.__
* PCA features are ordered by variance descending.

* Plotting the variances of PCA features:

```
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
pca = PCA()
pca.fit(samples)
```
* Now create a range enumerating the PCA features and make a bar plot of the variances
```
feature = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xticks(features)
plt.xlabel=('PCA feature')
plt.ylabel=('variance')
plt.show()
```
* intrinsic dimension can be ambiguous
* intrinsic dimension is an idealization
* there is not always one correct answer

* The first principal component of the data is the direction in which the data varies the most.

```
#Make a scatter plot of the untransformed points
plt.scatter(grains[:,0], grains[:,1])
#Create a PCA instance: model
model = PCA()
#Fit model to points
model.fit(grains)
#Get the mean of the grain samples: mean
mean = model.mean_
#Get the first principal component: first_pc
first_pc = model.components_[0,:]
#Plot first_pc as an arrow, starting at mean
plt.arrow(mean[0], mean[1], first_pc[0], first_pc[1], color='red', width=0.01)
#Keep axes on same scale
plt.axis('equal')
plt.show()
```

```
#Perform the necessary imports
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
#Create scaler: scaler
scaler = StandardScaler()
#Create a PCA instance: pca
pca = PCA()
#Create pipeline: pipeline
pipeline = make_pipeline(scaler, pca)
#Fit the pipeline to 'samples'
pipeline.fit(samples)
#Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.xlabel('PCA feature')
plt.ylabel('variance')
plt.xticks(features)
plt.show()
```

#### Dimension reduction with PCA
* Represents same data, using less features
* important part of machine-learning pipelines
* can be performed using PCA
* dimesion reduction with PCA assumes the low variance features are noise and high variance features are informative
* to use PCA for dimension reduction, you need to specify how many features to keep, e.g. `PCA(n_components=2)`

```
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(samples)
transformed = pca.transform(samples)
print(transformed.shape)
```
* PCA discards low variance features and assumes the high variance features are informative

#### Word frequency arrays
* word frequencies ("tf-idf")
* rows represent documents, columns represent words (from a fixed vocabulary)
* most entries of the word frequency array are 0 (sparse)
* Arrays like this, with a lot of zeros, are said to be "sparse" and are often represented using a special type of array called a csr matrix: `scipy.sparse.csr_matrix`
* `csr_matrix` saves space by only remembering the non-zero entries
* sci-kit learn's `PCA` doesn't support `csr_matrix`
* Use sklearn's `TruncatedSVD` instead, which performs same transformation

```
from sklearn.decomposition import TruncatedSVD
model = TruncatedSVD(n_components=3)
model.fit(documents) #documents is csr_matrix
transformed = model.transform(documents)
```

```
#Import PCA
from sklearn.decomposition import PCA
#Create a PCA model with 2 components: pca
pca = PCA(n_components=2)
#Fit the PCA instance to the scaled samples
pca.fit(scaled_samples)
#Transform the scaled samples: pca_features
pca_features = pca.transform(scaled_samples)
#Print the shape of pca_features
print(pca_features.shape)
```
#### tf-idf word frequency array
* `TfidfVectorizer` from sklearn; transforms a list of documents into a word frequency array, which it outputs as a csr_matrix. It has `fit()` and `transform()` methods like other sklearn objects.

```
#Perform the necessary imports
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
#Create a TruncatedSVD instance: svd
svd = TruncatedSVD(n_components=50)
#Create a KMeans instance: kmeans
kmeans = KMeans(n_clusters=6)
#Create a pipeline: pipeline
pipeline = make_pipeline(svd, kmeans)
```

```
#Import pandas
import pandas as pd
#Fit the pipeline to articles
pipeline.fit(articles)
#Calculate the cluster labels: labels
labels = pipeline.predict(articles)
#Create a DataFrame aligning labels and titles: df
df = pd.DataFrame({'label': labels, 'article': titles})
#Display df sorted by cluster label
print(df.sort_values('label'))
```



## Non-negative matrix factorization (NMF)
* __NMF:__ Non-negaive matrix factorization
* NMF, like PCA, is a dimension reduction technique
* NMF models are interprettable (unlike PCA)
* NMF models are easy to interpret and explain
* However, requires all sample features be non-negative (>=0)
* NMF achieves its interpretability by decomposing samples as sums of their parts
    * NMF expresses _documents_ as combinations of topics (or "themes")
    * NMF expresses _images_ as combinations of patterns.
* NMF available in `sklearn`
* Follows `fit()` / `transform()` (same as PCA)
* However, unlike PCA, the desired number of components must always be specified: `NMF(n_components=2)`
* NMF works with numpy arrays and sparse arrays in the `csr_matrix` format
* __tf-idf:__
    * __tf:__ frequency of word in the document. Example: if 10% of the words in Document1 are 'DataCamp', then the `tf` for DataCamp for that document is 0.1.
    * __idf:__ is a weighting scheme that reduces the influence of frequent words like "the"
    
```
from sklearn.decomposition import NMF
model = NMF(n_components=2)
model.fit(samples)
nmf_features = model.transform(samples)
print(model.components_)
print(nmf_features)
```
* Just like PCA has components, NMF has components which it learns from the samples
* As with PCA, the dimension of the components of NMF is the same as the dimension of samples
* Entries of the NMF component are _always_ non-negative
* NMF feature values are _always_ non-negative as well
* the features and the components can be used to (approximately) reconstruct the original data set.
__Reconstruction of a sample:__

```
print(samples[i,:])
print(nmf_features[i,:])
```
* If we multiply each of the NMF components (directly above) by the corresponding NMF feature value (also directly above) and add up each column, we get something very close to the original sample.
* A sample can be reconstructed by multiplying components by feature values and adding them up.
* Can also be expressed as a product of matrices (this is the **M**atrix **F**actorization of **NMF**)
* Again, **NMF fits to nonnegative data only** (like in word frequency arrays, images encoded as arrays, arrays encoding audio spectograms, arrays representing the purchase histories on e-commerce sites).

```
#Import NMF
from sklearn.decomposition import NMF
#Create an NMF instance: model
model = NMF(n_components=6)
#Fit the model to articles
model.fit(articles)
#Transform the articles: nmf_features
nmf_features = model.transform(articles)
#Print the NMF features
print(nmf_features.round(2))
```

```
#Import pandas
import pandas as pd
#Create a pandas DataFrame: df
df = pd.DataFrame(nmf_features, index=titles)
#Print the row for 'Anne Hathaway'
print(df.loc['Anne Hathaway'])
#Print the row for 'Denzel Washington'
print(df.loc['Denzel Washington'])
```

* Apply NMF model to articles:

```
print(articles.shape) # =(20000,800)
from sklearn.decomposition import NMF
nmf = NMF(n_components=10)
nmf.fit(articles)
print(nm.components_.shape) # =(10,800): 10 components defined above, 800 words chosen from original dataset
```
* NMF rows, or components live in an 800-dimensional space (1 dimension for each of the words)
* Choosing a component and looking at which words have the highest values, we see that they fit a theme. Top words:
    * species
    * plant
    * plants
    * genetic
    * evolution
    * life
* **So, if NMF is applied to documents, then the components correspond to topics.** And the NMF features reconstruct the documents from the topics
* **For documents:**
    * NMF components represent topics
    * NMF features combine topics into documents
* **For images:**
    * NMF components are parts of images (patterns that frequently occur in the images)
    
**Representing a collection of images as a non-negative array:**
* Grayscale image = no colors, only shades of gray ranging from black to white
* Since there are only shades of gray (and no colors), a grayscale image can be encoded with __measures of pixel brightness__ 
* **Measures of pixel brightness** is represented with a value between 0 and 1 (0 is black)
* Then the image can be represented as a 2-D array
* These 2-D arrays of numbers can then be flattened by __enumerating the entries.__ For instance, we could read off the entries row by row, from left to right, top to bottom.
* Thus, a collection of images of the same size can be encoded as a 2-D array in which each row corresponds to an image and each column represents a pixel
* Viewing the images as samples, and the pixels as features, we see that the data is arranged similarly to the word frequency array
* Since all entries are non-negative, NMF can be used to learn the parts of images
* to recover the image: 

```
bitmap = sample.reshape((2,3))
print(bitmap)
```
* This yields a 2-D array of pixel brightness measurements
* To display the corresponding image:

```
from matplotlib import pyplot as plt
plt.imshow(bitmap, cmap= 'gray', interpolation= 'nearest')
plt.show()
```

```
#Import pandas
import pandas as pd
#Create a DataFrame: components_df
components_df = pd.DataFrame(model.components_, columns=words)
#Print the shape of the DataFrame
print(components_df.shape)
#Select row 3: component
component = components_df.iloc[3,:]
#Print result of nlargest
print(component.nlargest())
```

```
#Import pyplot
from matplotlib import pyplot as plt
#Select the 0th row: digit
digit = samples[0,:]
#Print digit
print(digit)
#Reshape digit to a 13x8 array: bitmap
bitmap = digit.reshape(13,8)
#Print bitmap
print(bitmap)
#Use plt.imshow to display bitmap
plt.imshow(bitmap, cmap='gray', interpolation='nearest')
plt.colorbar()
plt.show()
```
* 7 is the number of cells in an LED

```
#Import NMF
from sklearn.decomposition import NMF
#Create an NMF model: model
model = NMF(n_components=7)
#Apply fit_transform to samples: features
features = model.fit_transform(samples)
#Call show_as_image on each component
for component in model.components_:
    show_as_image(component)
#Assign the 0th row of features: digit_features
digit_features = features[0,:]
#Print digit_features
print(digit_features)
```

# Function to display image version of any 1D array:

In [1]:
def show_as_image(sample):
    bitmap = sample.reshape((13, 8))
    plt.figure()
    plt.imshow(bitmap, cmap='gray', interpolation='nearest')
    plt.colorbar()
    plt.show()

### Other:

* Unlike NMF, PCA doesn't learn the parts of things. 
* Its components do not correspond to topics (in the case of documents) or to parts of images, when trained on images.

```
#Import PCA
from sklearn.decomposition import PCA
#Create a PCA instance: model
model = PCA(n_components=7)
#Apply fit_transform to samples: features
features = model.fit_transform(samples)
#Call show_as_image on each component
for component in model.components_:
    show_as_image(component)
```

## Building recommender systems using NMF
* Task: finding similar articles
* Similar articles should have similar topics and similar NMF features
* Strategy:
    * Apply NMF to the word-frequency array of the articles 
    * 
    
```
from sklearn.decomposition import NMF
nmf = NMF(n_components =6)
nmf_features = nmf.fit_transform(articles)
```
* Now we've got NMF features for every article, given by the columns of the new array
* Now we need to define how to compare the articles, using their NMF features
* Different versions of the same document have same/similar topic proportions, but it isn't always the case that the NMF feature values are exactly the same
    * For instance, one version of a document may use very direct language, whereas another might interweave content with "meaningless chatter"
    * "Meaningless chatter" reduces the frequency of the topic words overall, which reduces the values of the NMF features representing the topics. 
    * However, **on a scatterplot of the NMF features, all these versions lie on a single line passing through the origin.**
    * For this reason, when comparing two documents,  it's a good idea to compare these lines. 
* **Cosine similarity:** uses the angle between the lines; higher values mean more similarity; maximum value is 1, when angle is 0 degrees.
* **Calculating the cosine similarity:**

```
from sklearn.preprocessing import normalize
norm_features = normalize(nmf_features)
#now select row corresponding to current article; if has index 23:
current_article = norm_features[23,:] 
similarities = norm_features.dot(current_article)
print(similarities)
```
* This results in the cosine similarities
* With the help of a pandas dataframe, we can label the similarities with the article titles
* titles given as a list: `titles`

```
import pandas as pd
norm_features = normalize(nmf_features)
df = pd.DataFrame(norm_features, index = titles)
current_article = df.loc['Dog bites man']
similarities = df.dot(current_article)
print(similarities.nlargest())
```

```
#Perform the necessary imports
import pandas as pd
from sklearn.preprocessing import normalize
#Normalize the NMF features: norm_features
norm_features = normalize(nmf_features)
#Create a DataFrame: df
df = pd.DataFrame(norm_features, index=titles)
#Select the row corresponding to 'Cristiano Ronaldo': article
article = df.loc['Cristiano Ronaldo']
#Compute the dot products: similarities
similarities = df.dot(article)
#Display those with the largest cosine similarity
print(similarities.nlargest())
```

```
#Perform the necessary imports
from sklearn.decomposition import NMF
from sklearn.preprocessing import Normalizer, MaxAbsScaler
from sklearn.pipeline import make_pipeline
#Create a MaxAbsScaler: scaler
scaler = MaxAbsScaler()
#Create an NMF model: nmf
nmf = NMF(n_components=20)
#Create a Normalizer: normalizer
normalizer = Normalizer()
#Create a pipeline: pipeline
pipeline = make_pipeline(scaler, nmf, normalizer)
#Apply fit_transform to artists: norm_features
norm_features = pipeline.fit_transform(artists)
```

```
#Import pandas
import pandas as pd
#Create a DataFrame: df
df = pd.DataFrame(norm_features, index=artist_names)
#Select row of 'Bruce Springsteen': artist
artist = df.loc['Bruce Springsteen']
#Compute cosine similarities: similarities
similarities = df.dot(artist)
#Display those with highest cosine similarity
print(similarities.nlargest())
```