## 2. Analyze State Operator Tweets for Natural Tweet Selection
For our analysis we will be using the HuggingFace Transformers library. HuggingFace hosts a large number of models and associated tokenizers for various tasks, and we will be utilizing the default model for Named-Entity Recognition (NER). This will allow us to extract the places, organizations, and people mentioned in the state operator tweets we processed in the previous notebook.

The resulting list of named entities will allow us to find natural tweets on similar topics that we can use as examples for our model. We don't want to train the model on a completely random selection of natural tweets, because the topic distribution for natural tweets is likely to be significantly different from that of state operator tweets--and thus the model would likely simply be differentiating tweets on topics rather than specific state operator indicators.

The most useful situation in which we could discriminate between natural and state operator tweets is when we are confronted with tweets on news, politics, or culture related to topics of national interest for state operators, so we will use that bias to select our natural tweets in the next notebook. There will likely still be some differences in topic and named entity distribution, but practically speaking it will be difficult to match those distributions exactly.

### 2.1 Setup

In [None]:
#import os
import pandas as pd
import numpy as np
#import torch

from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
from datasets import load_dataset

We will now create the pipeline for NER and test an example tweet to see the output.

In [None]:
nlp = pipeline('ner', grouped_entities=True, device=0)
example = 'Russia presents bid for EXPO 2025 to Association of Caribbean States. Delegations from over 20 states and unions acknowledged benefits of holding the EXPO in Ekaterinburg. https://t.co/SeOJtjDqOQ'

ner_results = nlp(example)
print(ner_results)

Below we also define a function to assist in adding the NER data to a dataframe.

In [None]:
def put_ners_into_df(entities):
    """Converts list of dictionaries returned by the NER pipeline into a dataframe
    
    Args:
        entities (list of list of dicts): List of dictionaries containing NER information
    
    Returns:
        Pandas dataframe: Single easy-to-read dataframe with NER information
    """
    all_entities = []
    for t in entities:
        for e in t:
            all_entities.append(e)
    return pd.DataFrame.from_records(all_entities)

### 2.2 Russian Tweet Named Entities

#### 2.2.1 Create Dataset

In [None]:
#rs = pd.read_csv("../working_files/russian_tweet_sequences.csv",lineterminator="\n")
#tweets = rs.clean_tweets.to_list()

# load the data into the transformers library's custom dataset class
rs_dataset = load_dataset("csv", data_files="../working_files/russian_tweet_sequences.csv",lineterminator='\n')

Below we see the structure of the `transformers dataset` object we created. `train` is the default dataset.

In [None]:
rs_dataset

#### 2.2.2 Run Model on Data

In [None]:
tweet_entities = nlp(rs_dataset['train']['clean_tweets'])

In [None]:
entity_df = put_ners_into_df(tweet_entities)
results = pd.pivot_table(entity_df, values=['start'], index=['entity_group','word'], aggfunc=np.ma.count)
results = results.sort_values(by=['start'],ascending=[False])
results = results.rename(columns={'start':'count'})
results.head()

The entities shown above appear to be about what we'd expect. Let's save the file below.

In [None]:
results.to_csv('../working_files/russian_entities.csv',sep=',', quotechar='"',header=True)

### 2.3 Chinese Tweet Named Entities

#### 2.3.1 Create Dataset

In [None]:
cn_dataset = load_dataset("csv", data_files="../working_files/chinese_tweet_sequences.csv",lineterminator='\n')

#### 2.3.2 Run Model on Data

In [None]:
tweet_entities = nlp(cn_dataset['train']['clean_tweets'])

In [None]:
entity_df = put_ners_into_df(tweet_entities)
results = pd.pivot_table(entity_df, values=['start'], index=['entity_group','word'], aggfunc=np.ma.count)
results = results.sort_values(by=['start'],ascending=[False])
results = results.rename(columns={'start':'count'})
results.head()

In [None]:
results.to_csv('../working_files/chinese_entities.csv',sep=',', quotechar='"',header=True)