In [15]:
import pandas as pd
import nltk

### Punkt sentence tokenizer
- 採用Punkt斷詞模型(可以有比較好的斷詞效果，相較string的splitting方法)

In [1]:
nltk.download('punkt')

NameError: name 'nltk' is not defined

## One-hot encoding
- It is also called as Dummy Variable虛擬變數。
- 常用來轉換Categorical data or Textual data into numeric data
- 優點自然是拓展features數量，增加特徵數量就可以多一個預測的依據
- 轉化過的特徵為各自獨立、分開，這未必是優點，因為會導致之間的interactions無法被抓取
- 各自獨立，有利於feature selection並且找出對於target最有影響力的關鍵特徵<br>
這特色很適合配合regression analysis回歸分析使用

## One-hot encoding表達textual data的weaknesses
- 若article/paragraph/article為一個category那會完全無相似性可以比較
- 若用one-hot encoding針對word是否(True/False)來比對則writers之間比較上又沒差異(under-fitting)
- word-count相對來說比較有差異性

## Representation of Word count
- 但它問題跟dummy variable差不多，雖然他抽象層級上比one-hot encoding來得好，但是會有curse dimensionality
- 另一個問題是，sparse matrix，很多0，這會造成training上的難題

In [12]:
russian_tweets_df = pd.read_csv("russian_tweets.csv")
ted_df = pd.read_csv("ted.csv")

In [9]:
russian_tweets_df.head()

Unnamed: 0.1,Unnamed: 0,content
0,127447,LIVE STREAM VIDEO=> Donald Trump Rallies in Co...
1,123642,Muslim Attacks NYPD Cops with Meat Cleaver. Me...
2,226970,.@vfpatlas well that's a swella word there (di...
3,138339,RT wehking_pamela: Bobby_Axelrod2k MMFlint don...
4,161610,Жители обстреливаемых районов Донецка проводят...


In [13]:
ted_df.head()

Unnamed: 0,transcript,url
0,"We're going to talk — my — a new lecture, just...",https://www.ted.com/talks/al_seckel_says_our_b...
1,"This is a representation of your brain, and yo...",https://www.ted.com/talks/aaron_o_connell_maki...
2,It's a great honor today to share with you The...,https://www.ted.com/talks/carter_emmart_demos_...
3,"My passions are music, technology and making t...",https://www.ted.com/talks/jared_ficklin_new_wa...
4,It used to be that if you wanted to get a comp...,https://www.ted.com/talks/jeremy_howard_the_wo...


### pandas df的apply method
apply是在做transformation用，他可以帶入某func給該欄位做轉換
- 這邊計算the number of characters of content
- 其實# of char也是一種textual data轉numeric的representation<br>
(There are a bit meaningful insights in the lengths of contents)
- 不過其實char count單位比較小，可能產生的意義會比較少，但是還是可以作為一些sentiment analysis的參考，譬如文長、文短也有許多文外的意思存在

### pandas df的column的mean() method
這個比較是在整體的中間值center，可作為一個衡量標準，This concept is based on statistics

In [10]:
russian_tweets_df['char_count'] = russian_tweets_df['content'].apply(len)
russian_tweets_df.head()

Unnamed: 0.1,Unnamed: 0,content,char_count
0,127447,LIVE STREAM VIDEO=> Donald Trump Rallies in Co...,130
1,123642,Muslim Attacks NYPD Cops with Meat Cleaver. Me...,138
2,226970,.@vfpatlas well that's a swella word there (di...,65
3,138339,RT wehking_pamela: Bobby_Axelrod2k MMFlint don...,139
4,161610,Жители обстреливаемых районов Донецка проводят...,131


In [11]:
russian_tweets_content_mean = russian_tweets_df['char_count'].mean()
russian_tweets_content_mean

103.462

## Tokenization斷詞
這邊不採用splitting的方式來進行斷詞，因為他的斷詞rules過於簡單，單純以space char來進行，自然會有許多切不開的情況發生
<br>(splitting方式就是寫token_func配合apply，逐筆去處理)
- Tokenization is to break and split text into smaller chunk
- 採用nltk的word_tokenize，之前有聽過nltk並非industrial level的tool，因此量大的時候就得斟酌使用<br>
(其實nltk跑得挺慢的...)

In [20]:
def tokenization(content_str):
    token_list = nltk.word_tokenize(content_str)
    return token_list
def token_count(word_list):
    return len(word_list)

In [22]:
ted_df['sent_tokens'] = ted_df['transcript'].apply(tokenization)
ted_df['token_count'] = ted_df['sent_tokens'].apply(token_count)
ted_df.head()

Unnamed: 0,transcript,url,word_count,sent_tokens,token_count
0,"We're going to talk — my — a new lecture, just...",https://www.ted.com/talks/al_seckel_says_our_b...,"[We, 're, going, to, talk, —, my, —, a, new, l...","[We, 're, going, to, talk, —, my, —, a, new, l...",2003
1,"This is a representation of your brain, and yo...",https://www.ted.com/talks/aaron_o_connell_maki...,"[This, is, a, representation, of, your, brain,...","[This, is, a, representation, of, your, brain,...",1595
2,It's a great honor today to share with you The...,https://www.ted.com/talks/carter_emmart_demos_...,"[It, 's, a, great, honor, today, to, share, wi...","[It, 's, a, great, honor, today, to, share, wi...",1001
3,"My passions are music, technology and making t...",https://www.ted.com/talks/jared_ficklin_new_wa...,"[My, passions, are, music, ,, technology, and,...","[My, passions, are, music, ,, technology, and,...",1826
4,It used to be that if you wanted to get a comp...,https://www.ted.com/talks/jeremy_howard_the_wo...,"[It, used, to, be, that, if, you, wanted, to, ...","[It, used, to, be, that, if, you, wanted, to, ...",3974


### The mean of # of words
- 我認為這比# of chars來說，more representative and less noises
- There are less noises. 我認為單存字母的量構成的長度大小其實很多都是noise雜訊，那樣的representation並不具有representative代表性

In [24]:
ted_df['token_count'].mean()

2344.85

### Hashtag detection by str startswith()
- social network經常會有hashtag在初始位址，透過這個方法可以startswith()可以做偵測
- 並以one-hot encoding方式來做分類，可以做為分析的一種依據與區隔<br>
(可以配合stratification方式，依據hashtag種類座分層，將句子做分類後，做各自獨立的分析，可能會較具代表性)

In [69]:
def hash_tags(token_list):
    if token_list[0].startswith('#'):
        tag = '#'
    else:
        tag = 'normal sent'
    return tag
def word_split(sent):
    # default 分隔為space char
    words = sent.split()
    return words

In [70]:
#因為沒有russian tokenization model因此採用splitting way
russian_tweets_df['sent_tokens'] = russian_tweets_df['content'].apply(word_split)
russian_tweets_df.head()

Unnamed: 0.1,Unnamed: 0,content,sent_tokens,tag
0,127447,LIVE STREAM VIDEO=> Donald Trump Rallies in Co...,"[LIVE, STREAM, VIDEO=>, Donald, Trump, Rallies...",normal sent
1,123642,Muslim Attacks NYPD Cops with Meat Cleaver. Me...,"[Muslim, Attacks, NYPD, Cops, with, Meat, Clea...",normal sent
2,226970,.@vfpatlas well that's a swella word there (di...,"[.@vfpatlas, well, that's, a, swella, word, th...",normal sent
3,138339,RT wehking_pamela: Bobby_Axelrod2k MMFlint don...,"[RT, wehking_pamela:, Bobby_Axelrod2k, MMFlint...",normal sent
4,161610,Жители обстреливаемых районов Донецка проводят...,"[Жители, обстреливаемых, районов, Донецка, про...",normal sent


In [71]:
russian_tweets_df['tag'] = russian_tweets_df['sent_tokens'].apply(hash_tags)
russian_tweets_df.head()

Unnamed: 0.1,Unnamed: 0,content,sent_tokens,tag
0,127447,LIVE STREAM VIDEO=> Donald Trump Rallies in Co...,"[LIVE, STREAM, VIDEO=>, Donald, Trump, Rallies...",normal sent
1,123642,Muslim Attacks NYPD Cops with Meat Cleaver. Me...,"[Muslim, Attacks, NYPD, Cops, with, Meat, Clea...",normal sent
2,226970,.@vfpatlas well that's a swella word there (di...,"[.@vfpatlas, well, that's, a, swella, word, th...",normal sent
3,138339,RT wehking_pamela: Bobby_Axelrod2k MMFlint don...,"[RT, wehking_pamela:, Bobby_Axelrod2k, MMFlint...",normal sent
4,161610,Жители обстреливаемых районов Донецка проводят...,"[Жители, обстреливаемых, районов, Донецка, про...",normal sent


### Getting dummy variables
- Through pd.get_dummies method

In [72]:
pd.get_dummies(russian_tweets_df, columns=['tag'])

Unnamed: 0.1,Unnamed: 0,content,sent_tokens,tag_#,tag_normal sent
0,127447,LIVE STREAM VIDEO=> Donald Trump Rallies in Co...,"[LIVE, STREAM, VIDEO=>, Donald, Trump, Rallies...",0,1
1,123642,Muslim Attacks NYPD Cops with Meat Cleaver. Me...,"[Muslim, Attacks, NYPD, Cops, with, Meat, Clea...",0,1
2,226970,.@vfpatlas well that's a swella word there (di...,"[.@vfpatlas, well, that's, a, swella, word, th...",0,1
3,138339,RT wehking_pamela: Bobby_Axelrod2k MMFlint don...,"[RT, wehking_pamela:, Bobby_Axelrod2k, MMFlint...",0,1
4,161610,Жители обстреливаемых районов Донецка проводят...,"[Жители, обстреливаемых, районов, Донецка, про...",0,1
...,...,...,...,...,...
995,24329,How To Inspire People With Your Music! https:/...,"[How, To, Inspire, People, With, Your, Music!,...",0,1
996,315060,... https://t.co/AfWdTkKQlm,"[..., https://t.co/AfWdTkKQlm]",0,1
997,361901,Trevor Noah: Until we start treating racism li...,"[Trevor, Noah:, Until, we, start, treating, ra...",0,1
998,119948,SenSanders: RT SenJeffMerkley: We must act bol...,"[SenSanders:, RT, SenJeffMerkley:, We, must, a...",0,1
