https://medium.com/@ageitgey/text-classification-is-your-new-secret-weapon-7ca4fad15788

## classification model을 써서 meaning 추출하기

Yelp의 5성짜리 리뷰 데이터이다. 

우리의 text classification model을 학습시키기 위해서, 비슷한 장소에서의 리뷰를 수집하였다. 

모델을 학습시킨다음에는, 새로운 텍스트에 대한 예측을 하였따. 

어떻게 이게 가능할까?

왜 이해가 아니라 분류 문제로 접근하는 것이 test에서 유리할까?

1. 사람들은 계속해서 언어를 창조하고 진화시킨다. 분류를 하면 표준어를 쓴는지 이모지를 쓰는지 아무 상관 안한다. 

2. 웹사이트 유저는 당신이 기대하는 특정한 언어만을 구사하지 않는다. 

3. fast하다. 분류 알고리즘은 매우 간단하다. RNN 머신러닝 모델 등과 비교해서는. 

## FastText란?

페북의 패스트텍스트는 오픈 소스이다. 

## Step1: 학습용 데이터 다운로드 

{
  "review_id": "abc123",
  "user_id": "xyy123",
  "business_id": "1234",
  "stars": 5,
  "date":" 2015-01-01",
  "text": "This restaurant is great!",
  "useful":0,
  "funny":0,
  "cool":0
}

## Step2: train과 test 분리해서 저장

In [8]:
import json
from pathlib import Path
import re
import random
import fasttext

reviews_data = Path("yelp-dataset") / "yelp_academic_dataset_review.json"
training_data = Path("fasttext_dataset_training.txt")
test_data = Path("fasttext_dataset_test.txt")


# What percent of data to save separately as test data
percent_test_data = 0.10

def strip_formatting(string):
    string = string.lower()
    string = re.sub(r"([.!?,'/()])", r" \1 ", string)
    return string

with reviews_data.open() as input, \
     training_data.open("w") as train_output, \
     test_data.open("w") as test_output:

    for line in input:
        review_data = json.loads(line)

        rating = review_data['stars']
        text = review_data['text'].replace("\n", " ")
        text = strip_formatting(text)

        fasttext_line = "__label__{} {}".format(rating, text)

        if random.random() <= percent_test_data:
            test_output.write(fasttext_line + "\n")
        else:
            train_output.write(fasttext_line + "\n")

## Step3: train

```cd fastText```

```./fasttext supervised -input ../fasttext_dataset_training.text -output ../reviews_model_ngrams -wordNgrams 2```



In [12]:

import fasttext
import re

def strip_formatting(string):
    string = string.lower()
    string = re.sub(r"([.!?,'/()])", r" \1 ", string)
    return string

# Reviews to check
reviews = [
    "This restaurant literally changed my life. This is the best food I've ever eaten!",
    "I hate this place so much. They were mean to me.",
    "I don't know. It was ok, I guess. Not really sure what to say."
]

# Pre-process the text of each review so it matches the training format
preprocessed_reviews = list(map(strip_formatting, reviews))

# Load the model
classifier = fasttext.load_model('reviews_model_ngrams.bin')

# Get fastText to classify each review with the model
labels, probabilities = classifier.predict(preprocessed_reviews, 1)

# Print the results
for review, label, probability in zip(reviews, labels, probabilities):
    stars = int(label[0][-3]) 
    print("{} ({}% confidence)".format("☆" * stars, int(probability[0] * 100)))
    print(review)
    print()
    
    

☆☆☆☆☆ (100% confidence)
This restaurant literally changed my life. This is the best food I've ever eaten!

☆ (97% confidence)
I hate this place so much. They were mean to me.

☆☆☆ (84% confidence)
I don't know. It was ok, I guess. Not really sure what to say.




