<font color='grey'>
  
### Huggingface Transformers
   
</font>

<font color='grey'>
    
>Let’s have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG), such as completing a prompt with new text or translating in another language.

>First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.

#### Getting started on a task with a pipeline

>The easiest way to use a pretrained model on a given task is to use <u>pipeline()</u>. 🤗 Transformers provides the following tasks out of the box:

>* Sentiment analysis: is a text positive or negative?*  
>* Text generation (in English): provide a prompt and the model will generate what follows.*    
>* Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)*
>* Question answering: provide the model with some context and a question, extract the answer from the context.*   
>* Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.*
>* Summarization: generate a summary of a long text.*
>* Translation: translate a text in another language.*
>* Feature extraction: return a tensor representation of the text.*
    
</font>

In [29]:
pip install transformers

Note: you may need to restart the kernel to use updated packages.


In [30]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

<font color='grey'>

> When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will look at both later on, but as an introduction the tokenizer’s job is to preprocess the text for the model, which is then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to make them readable. For instance:
    
</font>

In [32]:
classifier('i dont kmow how i feel')

[{'label': 'NEGATIVE', 'score': 0.9682340621948242}]

<font color='grey'>

>By default, the model downloaded for this pipeline is called “distilbert-base-uncased-finetuned-sst-2-english”.  It uses the DistilBERT architecture and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

#### Downloaded Pretrained Model Card:

>**DistilBERT base uncased finetuned SST-2**

>This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).

>**Fine-tuning hyper-parameters**
>* learning_rate = 1e-5*
>*batch_size = 32*
>*warmup = 600*
>*max_seq_length = 128*
>*num_train_epochs = 3.0*
</font>

<font color='grey'>
    
### Load targeted data from github

</font>

In [33]:
#url='https://raw.githubusercontent.com/drainganggtb/FinalProject/main/clean_tweets.csv'
#url='https://raw.githubusercontent.com/drainganggtb/FinalProject/main/new_df.csv'
url='https://raw.githubusercontent.com/drainganggtb/FinalProject/main/new_df_vaccine.csv'

In [34]:
# load data
import pandas as pd
statement_df = pd.read_csv(url,encoding='UTF-8',header=None,sep=',',error_bad_lines=False,usecols=[1],lineterminator='\n').drop(0,0)
statement_df.reset_index(drop=True, inplace=True)
statement_df.columns = ['statement']

In [35]:
statement_df

Unnamed: 0,statement
0,RT @___inCANdescent: Y’all are worried about t...
1,@Hellharbour Feel the same way. Makes my blood...
2,RT @___inCANdescent: Y’all are worried about t...
3,RT @GeorgeTakei: Trump called early virus warn...
4,RT @___inCANdescent: Y’all are worried about t...
...,...
50117,@lizzielevy Yes. There is a deep suspicion on ...
50118,@LokayFOX5 @LarryHogan how am I supposed to ge...
50119,RT @Sky_Lee_1: @mkraju Sit the f down on this ...
50120,"RT @LoudmouthLira: My friend, puffing on her 3..."


#### Drop Duplicates if any

In [36]:
statement_df = statement_df.dropna()
statement_df

Unnamed: 0,statement
0,RT @___inCANdescent: Y’all are worried about t...
1,@Hellharbour Feel the same way. Makes my blood...
2,RT @___inCANdescent: Y’all are worried about t...
3,RT @GeorgeTakei: Trump called early virus warn...
4,RT @___inCANdescent: Y’all are worried about t...
...,...
50117,@lizzielevy Yes. There is a deep suspicion on ...
50118,@LokayFOX5 @LarryHogan how am I supposed to ge...
50119,RT @Sky_Lee_1: @mkraju Sit the f down on this ...
50120,"RT @LoudmouthLira: My friend, puffing on her 3..."


### convert df into list

In [37]:
tweets_list = statement_df['statement'].tolist()

### pass list of values into trained model

In [38]:
result=[]
for data in range(len(tweets_list)):
  result.append(classifier(tweets_list[data]))
  

In [39]:
label=[]
score=[]
for d in result:
  for e in d:
    label.append(e['label'])
    score.append(e['score'])
 

In [40]:
import numpy as np
df = pd.DataFrame(np.column_stack([label, score]),columns=['label','score'])
df

Unnamed: 0,label,score
0,NEGATIVE,0.9951887130737305
1,NEGATIVE,0.9962158203125
2,NEGATIVE,0.9951887130737305
3,NEGATIVE,0.9471197128295898
4,NEGATIVE,0.9951887130737305
...,...,...
50117,NEGATIVE,0.9865070581436157
50118,NEGATIVE,0.9991227984428406
50119,NEGATIVE,0.7913647294044495
50120,NEGATIVE,0.9770839810371399


### validate percentage count

In [41]:
df['label'].value_counts(normalize=True) * 100

NEGATIVE    86.556801
POSITIVE    13.443199
Name: label, dtype: float64