<font color='grey'>
  
### Huggingface Transformers
   
</font>

<font color='grey'>
    
>Let’s have a quick look at the 🤗 Transformers library features. The library downloads pretrained models for Natural Language Understanding (NLU) tasks, such as analyzing the sentiment of a text, and Natural Language Generation (NLG), such as completing a prompt with new text or translating in another language.

>First we will see how to easily leverage the pipeline API to quickly use those pretrained models at inference. Then, we will dig a little bit more and see how the library gives you access to those models and helps you preprocess your data.

#### Getting started on a task with a pipeline

>The easiest way to use a pretrained model on a given task is to use <u>pipeline()</u>. 🤗 Transformers provides the following tasks out of the box:

>* Sentiment analysis: is a text positive or negative?*  
>* Text generation (in English): provide a prompt and the model will generate what follows.*    
>* Name entity recognition (NER): in an input sentence, label each word with the entity it represents (person, place, etc.)*
>* Question answering: provide the model with some context and a question, extract the answer from the context.*   
>* Filling masked text: given a text with masked words (e.g., replaced by [MASK]), fill the blanks.*
>* Summarization: generate a summary of a long text.*
>* Translation: translate a text in another language.*
>* Feature extraction: return a tensor representation of the text.*
    
</font>

In [1]:
pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/f9/54/5ca07ec9569d2f232f3166de5457b63943882f7950ddfcc887732fc7fb23/transformers-4.3.3-py3-none-any.whl (1.9MB)
[K     |████████████████████████████████| 1.9MB 21.8MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 49.3MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 52.9MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=4f5c519fc9

In [1]:
from transformers import pipeline
classifier = pipeline('sentiment-analysis')

HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=629.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=267844284.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=231508.0), HTML(value='')))




HBox(children=(HTML(value='Downloading'), FloatProgress(value=0.0, max=48.0), HTML(value='')))




<font color='grey'>

> When typing this command for the first time, a pretrained model and its tokenizer are downloaded and cached. We will look at both later on, but as an introduction the tokenizer’s job is to preprocess the text for the model, which is then responsible for making predictions. The pipeline groups all of that together, and post-process the predictions to make them readable. For instance:
    
</font>

In [2]:
classifier('i dont kmow how i feel')

[{'label': 'NEGATIVE', 'score': 0.9682340621948242}]

<font color='grey'>

>By default, the model downloaded for this pipeline is called “distilbert-base-uncased-finetuned-sst-2-english”.  It uses the DistilBERT architecture and has been fine-tuned on a dataset called SST-2 for the sentiment analysis task.

#### Downloaded Pretrained Model Card:

>**DistilBERT base uncased finetuned SST-2**

>This model is a fine-tune checkpoint of DistilBERT-base-uncased, fine-tuned on SST-2. This model reaches an accuracy of 91.3 on the dev set (for comparison, Bert bert-base-uncased version reaches an accuracy of 92.7).

>**Fine-tuning hyper-parameters**
>* learning_rate = 1e-5*
>*batch_size = 32*
>*warmup = 600*
>*max_seq_length = 128*
>*num_train_epochs = 3.0*
</font>

<font color='grey'>
    
### Load targeted data from github

</font>

In [5]:
#url='https://raw.githubusercontent.com/drainganggtb/FinalProject/main/clean_tweets.csv'
url='https://raw.githubusercontent.com/drainganggtb/FinalProject/main/new_df.csv'

In [10]:
# load data
import pandas as pd
statement_df = pd.read_csv(url,encoding='UTF-8',header=None,sep=',',error_bad_lines=False,usecols=[1],lineterminator='\n').drop(0,0)
statement_df.reset_index(drop=True, inplace=True)
statement_df.columns = ['statement']

In [11]:
statement_df

Unnamed: 0,statement
0,Machine learning on distributed Dask using Ama...
1,RT @realmleviticus: Top 4 python programming l...
2,RT @Eli_Krumova: Classification with Localizat...
3,RT @axelrod_eric: Artificial intelligence and ...
4,RT @Omkar_Raii: #AgriTech startups like @AgNex...
...,...
20026,RT @intuitibits: Airtool 2.2 is out. Adds supp...
20027,#XSLLabs #XSL #SYL #Sylare
20028,"RT @CZDS: PGA Tour signs up with AWS, looks to..."
20029,RT @Strat_AI: Build a #Chatbot with #DeepLearn...


#### Drop Duplicates if any

In [27]:
statement_df = statement_df.dropna()
statement_df

Unnamed: 0,statement
0,Machine learning on distributed Dask using Ama...
1,RT @realmleviticus: Top 4 python programming l...
2,RT @Eli_Krumova: Classification with Localizat...
3,RT @axelrod_eric: Artificial intelligence and ...
4,RT @Omkar_Raii: #AgriTech startups like @AgNex...
...,...
20026,RT @intuitibits: Airtool 2.2 is out. Adds supp...
20027,#XSLLabs #XSL #SYL #Sylare
20028,"RT @CZDS: PGA Tour signs up with AWS, looks to..."
20029,RT @Strat_AI: Build a #Chatbot with #DeepLearn...


### convert df into list

In [28]:
tweets_list = statement_df['statement'].tolist()

### pass list of values into trained model

In [14]:
result=[]
for data in range(len(tweets_list)):
  result.append(classifier(tweets_list[data]))
  

In [15]:
label=[]
score=[]
for d in result:
  for e in d:
    label.append(e['label'])
    score.append(e['score'])
 

In [18]:
import numpy as np
df = pd.DataFrame(np.column_stack([label, score]),columns=['label','score'])
df

Unnamed: 0,label,score
0,NEGATIVE,0.9953653812408447
1,POSITIVE,0.9941784739494324
2,NEGATIVE,0.996444046497345
3,NEGATIVE,0.9980283975601196
4,NEGATIVE,0.9861236214637756
...,...,...
20026,NEGATIVE,0.9911964535713196
20027,NEGATIVE,0.9798569679260254
20028,NEGATIVE,0.9968160390853882
20029,NEGATIVE,0.995816707611084


### validate percentage count

In [26]:
df['label'].value_counts(normalize=True) * 100

NEGATIVE    74.953822
POSITIVE    25.046178
Name: label, dtype: float64