## Kaggle Start Book in Python 

(PythonではじめるKaggleスタートブック)

Ishihara, Shotaro; Murata, Hideki Practical Data Science Series: Kaggle Start Book in Python (KS Information Science Specialized Book) . Kodansha.

石原祥太郎; 村田秀樹. 実践Data Scienceシリーズ　PythonではじめるKaggleスタートブック (ＫＳ情報科学専門書) . 講談社. 

# section 3.3 Going Beyond Titanic (3)! Let's touch the text data!

(3.3 Titanicの先へ行く③! テキストデータに触れてみよう)

original NoteBook :

https://www.kaggle.com/code/sishihara/py310-python-kaggle-start-book-ch03-03

In [1]:
import pandas as pd

In [2]:
df = pd.DataFrame({'text': ['I like kaggle very much',
                            'I do not like kaggle',
                            'I do really love machine learning']})
df

Unnamed: 0,text
0,I like kaggle very much
1,I do not like kaggle
2,I do really love machine learning


# Bag of Words

How to count the number of times a word appears in a sentence.

In [3]:
from sklearn.feature_extraction.text import CountVectorizer


vectorizer = CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b')
bag = vectorizer.fit_transform(df['text'])
bag.toarray()



array([[0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1],
       [1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0],
       [1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0]])

In [4]:
print(vectorizer.vocabulary_)

{'i': 1, 'like': 4, 'kaggle': 2, 'very': 10, 'much': 7, 'do': 0, 'not': 8, 'really': 9, 'love': 5, 'machine': 6, 'learning': 3}


>　 Bag of Words is a simple and easy-to-understand method, but it has the following weaknesses:
  
>1. It does not express word rarity. 
>2. It does not take into account the proximity of words 
>3. It leaves out information about the order of words in a sentence.

Ishihara, Shotaro; Murata, Hideki Kaggle Start Book in Python (KS Information Science Specialized Book) (p.203). Kodansha. Kindle edition. 

# TF-IDF

TF-IDF is a method that takes into account the rarity of the words that appear. It not only counts the “Term Frequency” (the frequency with which a word appears), but also multiplies it by the “Inverse Document Frequency” (the reciprocal of the frequency with which a word appears in a document).

In [5]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


vectorizer = CountVectorizer(token_pattern=u'(?u)\\b\\w+\\b')
transformer = TfidfTransformer()

tf = vectorizer.fit_transform(df['text'])
tfidf = transformer.fit_transform(tf)
print(tfidf.toarray())

[[0.         0.31544415 0.40619178 0.         0.40619178 0.
  0.         0.53409337 0.         0.         0.53409337]
 [0.43306685 0.33631504 0.43306685 0.         0.43306685 0.
  0.         0.         0.56943086 0.         0.        ]
 [0.34261996 0.26607496 0.         0.45050407 0.         0.45050407
  0.45050407 0.         0.         0.45050407 0.        ]]


In [6]:
print(vectorizer.vocabulary_)

{'i': 1, 'like': 4, 'kaggle': 2, 'very': 10, 'much': 7, 'do': 0, 'not': 8, 'really': 9, 'love': 5, 'machine': 6, 'learning': 3}


# Word2vec

A Vectoring Method for Capturing the Proximity of Word Meanings

In [7]:
from gensim.models import word2vec


sentences = [d.split() for d in df['text']]
model = word2vec.Word2Vec(sentences, vector_size=10, min_count=1, window=2, seed=7)

In [8]:
model.wv['like']

array([ 0.01650858,  0.01069946,  0.00188946,  0.09910005,  0.06153275,
        0.05853238,  0.04005488,  0.02443584, -0.03179482,  0.09779203],
      dtype=float32)

In [9]:
model.wv.most_similar('like')

[('I', 0.42540043592453003),
 ('machine', 0.36355969309806824),
 ('not', 0.311229407787323),
 ('kaggle', -0.004140517208725214),
 ('much', -0.11530755460262299),
 ('do', -0.1529018133878708),
 ('love', -0.25542783737182617),
 ('really', -0.4161785840988159),
 ('learning', -0.44330504536628723),
 ('very', -0.44338396191596985)]

In [10]:
df['text'][0].split()

['I', 'like', 'kaggle', 'very', 'much']

In [11]:
import numpy as np


wordvec = np.array([model.wv[word] for word in df['text'][0].split()])
wordvec

array([[ 0.08898099,  0.02501909,  0.03683598,  0.07944275,  0.01565849,
         0.05513714,  0.0667302 , -0.05495857, -0.08889369, -0.03996675],
       [ 0.01650858,  0.01069946,  0.00188946,  0.09910005,  0.06153275,
         0.05853238,  0.04005488,  0.02443584, -0.03179482,  0.09779203],
       [ 0.06329302, -0.03939352, -0.03167932, -0.04431488,  0.04389417,
        -0.04902608,  0.09809195, -0.01098474, -0.00437022,  0.00090965],
       [ 0.03720424, -0.02774719,  0.02864924,  0.01963681, -0.07835456,
        -0.08814968,  0.03203132, -0.02247364,  0.01966591, -0.03539274],
       [-0.09157717,  0.04835419, -0.00529734, -0.08170088, -0.05110302,
         0.00822875,  0.04535742,  0.00155444,  0.02258943,  0.07426786]],
      dtype=float32)

> By vectorizing words with this idea, sentences can be used as input for machine learning algorithms. The following are some examples of such methods.

> 1. Take the average of the vectors of words appearing in a sentence
> 2. Take the maximum value of each element of the word vectors appearing in a sentence
> 3. Treat each word as time series data

Ishihara, Shotaro; Murata, Hideki Kaggle Start Book in Python (KS Information Science Specialized Book) (p.205). Kodansha. Kindle edition. 

### 1. Take the average of the vectors of words appearing in a sentence

In [12]:
np.mean(wordvec, axis=0)

array([ 0.02288193,  0.00338641,  0.0060796 ,  0.01443277, -0.00167443,
       -0.0030555 ,  0.05645315, -0.01248533, -0.01656068,  0.01952201],
      dtype=float32)

### 2. Take the maximum value of each element of the word vectors appearing in a sentence

In [13]:
np.max(wordvec, axis=0)

array([0.08898099, 0.04835419, 0.03683598, 0.09910005, 0.06153275,
       0.05853238, 0.09809195, 0.02443584, 0.02258943, 0.09779203],
      dtype=float32)