## 4. Concatenate & Split Dataset
Now that we have a large number of tweet sequences for Russian and Chinese state operators as well as typical Twitter users, we'll combine the sequences and create a random test-train split.

In [None]:
import pandas as pd

### 4.1 Load and Join Files

In [None]:
russian = pd.read_csv("../working_files/russian_tweet_sequences.csv", lineterminator='\n',index_col=0)
russian['operator'] = 1

In [None]:
russian.head()

In [None]:
chinese = pd.read_csv("../working_files/chinese_tweet_sequences.csv", lineterminator='\n',index_col=0)
chinese['operator'] = 1

In [None]:
chinese.head()

In [None]:
real = pd.read_csv("../working_files/real_tweet_sequences.csv", lineterminator='\n',index_col=0)
real['operator'] = 0

In [None]:
real.head()

In [None]:
seqs = pd.concat([russian, chinese, real])
final_cols = ['userid','tweet_text','tweet_time','clean_tweets','recent_tweets','operator']

seqs = seqs[final_cols].copy()
seqs.shape

In [None]:
seqs['seq_id'] = range(0, len(seqs))
seqs.head(10)

In [None]:
seqs["operator"].value_counts()

### 4.2 Split Sets

### 4.2.1 Test Set
Here, we'll split off 10% of the data to use after we have tuned hyperparameters and fine-tuned the model. We keep this aside to ensure we aren't overfitting through hyperparameter selection.

In [None]:
test_set = seqs.sample(n=int(0.1*len(seqs)), random_state=13, replace=False)
test_set.shape

In [None]:
test_set.to_csv('../data/test.csv',index=False, sep=',', quotechar='"',header=True)

In [None]:
test_ids = test_set['seq_id'].values

### 4.2.2 Training Set
This dataset will be used for training, and we'll randomly split it into validation and training sets in the next notebook.

In [None]:
training_set = seqs[~seqs['seq_id'].isin(test_ids)]
training_set.shape

In [None]:
training_set.to_csv('../data/train.csv',index=False, sep=',', quotechar='"',header=True)