# Basic Text Representation
The Twitter dataset (`tweets.csv`) was collected in February of 2015. Contributors were asked to first classify positive, negative, and neutral tweets, followed by categorizing negative reasons (such as "late flight" or "rude service"). The dataset can be found [here.](https://www.kaggle.com/crowdflower/twitter-airline-sentiment)

You should build an NLP pipeline to find all tweets that are related to `bad catering service`. In particular, you should do the following:
- Load the `tweets` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Train a text representation model, such as the [bag of n-gram vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) or [TF-IDF vectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html), on the content of the tweets.
- Apply the trained text representation model to vectorize the query (i.e., `bad catering service`) and all documents (i.e., tweets).
- Calculate the similarity of each vectorized tweet to the vectorized query using a similarity measure, such as [cosine similarity](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html).
- Rank the tweets based on the similarity of their vectors to the query vector.
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

# **Import liberaries**

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# **Upload data**

In [None]:
df=pd.read_csv('/content/tweets.csv')
df

Unnamed: 0,tweet_id,airline_sentiment,airline_sentiment_confidence,negativereason,negativereason_confidence,airline,airline_sentiment_gold,name,negativereason_gold,retweet_count,text,tweet_coord,tweet_created,tweet_location,user_timezone
0,570306133677760513,neutral,1.0000,,,Virgin America,,cairdin,,0,@VirginAmerica What @dhepburn said.,,2015-02-24 11:35:52 -0800,,Eastern Time (US & Canada)
1,570301130888122368,positive,0.3486,,0.0000,Virgin America,,jnardino,,0,@VirginAmerica plus you've added commercials t...,,2015-02-24 11:15:59 -0800,,Pacific Time (US & Canada)
2,570301083672813571,neutral,0.6837,,,Virgin America,,yvonnalynn,,0,@VirginAmerica I didn't today... Must mean I n...,,2015-02-24 11:15:48 -0800,Lets Play,Central Time (US & Canada)
3,570301031407624196,negative,1.0000,Bad Flight,0.7033,Virgin America,,jnardino,,0,@VirginAmerica it's really aggressive to blast...,,2015-02-24 11:15:36 -0800,,Pacific Time (US & Canada)
4,570300817074462722,negative,1.0000,Can't Tell,1.0000,Virgin America,,jnardino,,0,@VirginAmerica and it's a really big bad thing...,,2015-02-24 11:14:45 -0800,,Pacific Time (US & Canada)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14635,569587686496825344,positive,0.3487,,0.0000,American,,KristenReenders,,0,@AmericanAir thank you we got on a different f...,,2015-02-22 12:01:01 -0800,,
14636,569587371693355008,negative,1.0000,Customer Service Issue,1.0000,American,,itsropes,,0,@AmericanAir leaving over 20 minutes Late Flig...,,2015-02-22 11:59:46 -0800,Texas,
14637,569587242672398336,neutral,1.0000,,,American,,sanyabun,,0,@AmericanAir Please bring American Airlines to...,,2015-02-22 11:59:15 -0800,"Nigeria,lagos",
14638,569587188687634433,negative,1.0000,Customer Service Issue,0.6659,American,,SraJackson,,0,"@AmericanAir you have my money, you change my ...",,2015-02-22 11:59:02 -0800,New Jersey,Eastern Time (US & Canada)


# **Text Represenation**
**Bag of N gram**

In [None]:
vectorizer=CountVectorizer(ngram_range=(1,3 ))
vectorizer.fit(df['text'])
documents=vectorizer.transform(df['text'])
documents

<14640x293223 sparse matrix of type '<class 'numpy.int64'>'
	with 684985 stored elements in Compressed Sparse Row format>

In [None]:
#Apply the trained text representation model to victorize the query 
query=vectorizer.transform(['the catring service'])
query

<1x293223 sparse matrix of type '<class 'numpy.int64'>'
	with 2 stored elements in Compressed Sparse Row format>

**The similarity**

In [None]:
# assuming that you have already vectorized your tweets and named it "vectorized_tweets"
# and you have already vectorized your query and named it "vectorized_query"
# vectorized_tweets and vectorized_query should be in the same feature space, i.e., they should have the same number of columns

# calculate cosine similarity between query and each tweet
similarities = cosine_similarity(query, documents)

# the similarities array will have shape (1, num_tweets), where each element represents the cosine similarity between the query and a tweet

# you can retrieve the index of the tweet with the highest similarity score using the argmax function
most_similar_tweet_index = similarities.argmax()

# you can also sort the similarities array in descending order and retrieve the indices of the top k most similar tweets using the argsort function
k = 10
top_k_tweet_indices = similarities.argsort()[0][-k:][::-1]
top_k_tweet_indices

array([ 9929,  1567,  6970, 10614,  6387, 13925,  7012, 10345,   141,
        4641])

**Top Rank**

In [None]:
# sort the similarity scores in descending order and retrieve the corresponding tweet indices
sorted_indices = np.argsort(similarities[0])[::-1]

# retrieve the text of the top k tweets
k = 10
top_k_tweets = df.iloc[sorted_indices[:k]]['text'].tolist()

# print the top k tweets and their similarity scores
for i, tweet in enumerate(top_k_tweets):
    similarity_score = similarities[0][sorted_indices[i]]
    print(f"Tweet #{i+1}: {tweet} (Similarity score: {similarity_score})")


Tweet #1: @USAirways thanks for the worst customer service on the face of the earth. I loved the 5+ hrs on hold along w 2 Cancelled Flightled flights (Similarity score: 0.40291148201269006)
Tweet #2: @united might possibly have the worst service on the planet. (Similarity score: 0.39391929857916763)
Tweet #3: @JetBlue We had 2 great flights into and out of the Bahamas, even during the bad weather in the northeast, thanks for the great service!!! (Similarity score: 0.3880752628531664)
Tweet #4: @USAirways Sitting on the runway at phl for the last 30 min because the correct weights for the flight aren't in the system? #jobfail (Similarity score: 0.3666177875533832)
Tweet #5: @SouthwestAir has the best customer service! (Similarity score: 0.36514837167011066)
Tweet #6: @AmericanAir The delay is nothing but the personnel being so combative up to the point of saying "what's the hury,  the plane is not leaving (Similarity score: 0.3646624787447363)
Tweet #7: @JetBlue loved the service from t