## Cosine Similarity Calculations
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Similarity measures have a multiude of uses in machine learning projects; they come in handy when matching strings, measuring distance, and extracting features. This similarity measurement is particularly concerned with orientation, rather than magnitude. 
In this case study, you'll use the cosine similarity to compare both a numeric data within a plane and a text dataset for string matching.

Load the Python modules, including cosine_similarity, from sklearn.metrics.pairwise

In [31]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.cm as cm
%matplotlib inline
plt.style.use('ggplot')
from scipy import spatial
from sklearn.metrics.pairwise import cosine_similarity

**<font color='teal'> Load the distance dataset into a dataframe. </font>**

In [32]:
df=pd.read_csv(r'C:\Users\zsoltani\Desktop\Zohreh Training and Personal files\Python class\Chapter 14\1585686145_CosineSimilarityCaseStudy\\distance_dataset (1).csv')

In [33]:
df.head

<bound method NDFrame.head of       Unnamed: 0         X         Y         Z  ClusterID
0              0  5.135779  4.167542  5.787635          4
1              1  4.280721  5.770909  6.091044          4
2              2  8.329098  7.540436  3.247239          2
3              3  5.470224  5.069249  5.768313          4
4              4  2.381797  2.402374  3.879101          1
...          ...       ...       ...       ...        ...
1995        1995  4.616245  4.019561  5.522939          4
1996        1996  4.753185  5.065076  8.074947          3
1997        1997  2.000186  2.351911  6.779311          1
1998        1998  4.735917  5.642677  4.855780          4
1999        1999  4.955436  5.270550  7.844768          3

[2000 rows x 5 columns]>

In [34]:
df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  2000 non-null   int64  
 1   X           2000 non-null   float64
 2   Y           2000 non-null   float64
 3   Z           2000 non-null   float64
 4   ClusterID   2000 non-null   int64  
dtypes: float64(3), int64(2)
memory usage: 78.2 KB


### Cosine Similarity with clusters and numeric matrices

All points in our dataset can be thought of as feature vectors. We illustrate it here as we display the __Cosine Similarity__ between each feature vector in the YZ plane and the [5, 5] vector we chose as reference. The sklearn.metrics.pairwise module provides an efficient way to compute the __cosine_similarity__ for large arrays from which we can compute the similarity.

 **<font color='teal'> First, create a 2D and a 3D matrix from the dataframe. The 2D matrix should contain the 'Y' and 'Z' columns and the 3D matrix should contain the 'X','Y', and 'Z' columns.</font>**

In [46]:
df['X']=df['X'].astype(float)
print(df['X'])
mat=np.array(df[['X','Y','Z']])
matYZ=np.array([['X','Y']])

0       5.135779
1       4.280721
2       8.329098
3       5.470224
4       2.381797
          ...   
1995    4.616245
1996    4.753185
1997    2.000186
1998    4.735917
1999    4.955436
Name: X, Length: 2000, dtype: float64


Calculate the cosine similarity for those matrices with reference planes of 5,5 and 5,5,5. Then subtract those measures from 1 in new features.

In [48]:
type(mat)

numpy.ndarray

----

### Cosine Similarity with text data
This is a quick example of how you can use Cosine Similarity to compare different text values or names for record matching or other natural language proecessing needs. 
First, we use count vectorizer to create a vector for each unique word in our Document 0 and Document 1. 

In [53]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
Document1 = "Starbucks Coffee"
Document2 = "Essence of Coffee"

corpus = [Document1,Document2]

X_train_counts = count_vect.fit_transform(corpus)

pd.DataFrame(X_train_counts.toarray(),columns=count_vect.get_feature_names(),index=['Document 0','Document 1'])

Unnamed: 0,coffee,essence,of,starbucks
Document 0,1,0,0,1
Document 1,1,1,1,0


Now, we use a common frequency tool called TF-IDF to convert the vectors to unique measures.

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
trsfm=vectorizer.fit_transform(corpus)
pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names(),index=['Document 0','Document 1'])

Unnamed: 0,coffee,essence,of,starbucks
Document 0,0.579739,0.0,0.0,0.814802
Document 1,0.449436,0.631667,0.631667,0.0


Here, we finally apply the __Cosine Similarity__ measure to calculate how similar Document 0 is compared to any other document in the corpus. Therefore, the first value of 1 is showing that the Document 0 is 100% similar to Document 0 and 0.26055576 is the similarity measure between Document 0 and Document 1.

In [55]:
cosine_similarity(trsfm[0:1], trsfm)

array([[1.        , 0.26055567]])

Replace the current values for `Document 0` and `Document 1` with your own sentence or paragraph and apply the same steps as we did in the above example.

 **<font color='teal'> Combine the documents into a corpus.</font>**

In [63]:
Document3 = "Amazon website"
Document4 = "Amazon forest"

corpus2 = [Document3,Document4]

 **<font color='teal'> Apply the count vectorizer to the corpus to transform it into vectors.</font>**

In [64]:
X_train_counts = count_vect.fit_transform(corpus2)



 **<font color='teal'> Convert the vector counts to a dataframe with Pandas.</font>**

In [65]:
pd.DataFrame(X_train_counts.toarray(),columns=count_vect.get_feature_names(),index=['Document 0','Document 1'])

Unnamed: 0,amazon,forest,website
Document 0,1,0,1
Document 1,1,1,0


 **<font color='teal'> Apply TF-IDF to convert the vectors to unique frequency measures.</font>**

In [66]:
vectorizer = TfidfVectorizer()
trsfm=vectorizer.fit_transform(corpus2)

 **<font color='teal'> Use the cosine similarity function to get measures of similarity for the sentences or paragraphs in your original document.</font>**

In [67]:
pd.DataFrame(trsfm.toarray(),columns=vectorizer.get_feature_names(),index=['Document 0','Document 1'])

Unnamed: 0,amazon,forest,website
Document 0,0.579739,0.0,0.814802
Document 1,0.579739,0.814802,0.0
