## Reviewing the news dataset
* 정치, 종교, 스포츠, 과학과 같은 20개의 다른 주제의 약 19,000개 뉴스 메시지

In [42]:
from sklearn.datasets import fetch_20newsgroups

In [50]:
categories = ['alt.atheism', 'soc.religion.christian','comp.graphics', 'sci.med']
news = fetch_20newsgroups(subset='all', categories=categories)

In [45]:
data = news.data
target = news.target
target_names = news.target_names

In [46]:
print( "=== TARGET NAMES ===")
print( target_names )

=== TARGET NAMES ===
['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']


In [47]:
print( "=== Data ===")
print( data[0] )
print( "=== Target ===")
print( target[0] )
print( "=== Target Name ===")
print( target_names[target[0]] )

=== Data ===
From: geb@cs.pitt.edu (Gordon Banks)
Subject: Re: "CAN'T BREATHE"
Article-I.D.: pitt.19440
Reply-To: geb@cs.pitt.edu (Gordon Banks)
Organization: Univ. of Pittsburgh Computer Science
Lines: 23

In article <1993Mar29.204003.26952@tijc02.uucp> pjs269@tijc02.uucp (Paul Schmidt) writes:
>I think it is important to verify all procedures with proper studies to
>show their worthiness and risk.  I just read an interesting tidbit that 
>80% of the medical treatments are unproven and not based on scientific 
>fact.  For example, many treatments of prostate cancer are unproven and
>the treatment may be more dangerous than the disease (according to the
>article I read.)

Where did you read this?  I don't think this is true.  I think most
medical treatments are based on science, although it is difficult
to prove anything with certitude.  It is true that there are some
things that have just been found "to work", but we have no good
explanation for why.  But almost everything does have a

# 문서 전처리
* 기계학습 모델은 수치 데이터 상에서 작동
* 문서(document) 역시 숫자로 구성된 자질 벡터(feature vector)를 추출하는 과정이 필요

In [49]:
from sklearn.feature_extraction.text import CountVectorizer
corpus = [
    'This is the first document.',
    'This is the second second document.',
    'And the third one.',
    'Is this the first document?',
    'The last document?',    
]
vect = CountVectorizer()
corpus_feature = vect.fit_transform(corpus).toarray()

print(vect.vocabulary_)
#['and' 'document' 'first' 'is' 'last' 'one' 'second' 'the' 'third' 'this' ]
print(corpus_feature[1])

{'this': 9, 'is': 3, 'the': 7, 'first': 2, 'document': 1, 'second': 6, 'and': 0, 'third': 8, 'one': 5, 'last': 4}
[0 1 0 1 0 0 2 1 0 1]


In [51]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X = count_vect.fit_transform(data).toarray()

print(np.shape(X))

(3759, 47319)


In [52]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, target, test_size=0.2)

print(len(X_train), len(X_test))
print(len(y_train), len(y_test))

3007 752
3007 752


In [53]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train, y_train)

In [54]:
y_pred = clf.predict(X_test)

from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, target_names=target_names))

                        precision    recall  f1-score   support

           alt.atheism       0.99      0.98      0.98       146
         comp.graphics       0.96      0.98      0.97       192
               sci.med       0.99      0.97      0.98       212
soc.religion.christian       0.98      0.99      0.98       202

           avg / total       0.98      0.98      0.98       752



In [56]:
from sklearn.cross_validation import cross_val_score
# warning: sklearn version 0.20 => from sklearn.model_selection import cross_val_score

accuracys = cross_val_score(MultinomialNB(), X, target, cv=10, scoring='accuracy')
print(accuracys.mean())

0.974458748432


In [3]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
from sklearn.cross_validation import cross_val_score


lines=[]
target_names=[]
target=[]
length=0

def unique(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]

def bigrams(input_list):
    return [list(a) for a in zip(input_list, input_list[1:])]

regex=['NNG','NNP','NNB','NP','NR','VV','VX','VCP','VCN','MAG','MAJ']

with open("C:/news.txt") as f:
    for line in f:
        if(line.startswith(';')):
            target_names.append(line[1:-1])
        lines.append(line.replace(" ","").replace("\t","").split('+'))
        length+=1
        
target_name=unique(target_names)

enc = LabelEncoder()
label_encoder = enc.fit(target_names)
target = label_encoder.transform(target_names)
        
resultlist = []   

for j in range(length):        
    for i in range(len(lines[j])):
          for s in range(len(regex)):        
                if regex[s] in lines[j][i]:     
                    resultlist.append(lines[j][i])
                    break
    if(j>=1 and " ".join(lines[j]).startswith(';')): #;으로 시작하는 경우               
        resultlist.append('\c')   #data구분기호 삽입
        
data=(" ".join(resultlist).split('\c'))   #구분기호 기준으로 split
data2=list(bigrams(resultlist))

for i in range(len(data2)):
    (data2[i])=("/".join(data2[i]))
    
data2=(" ".join(data2).split('\c'))  #구분기호 기준으로 split

for i in range(1,400):    #구분기호가 두 개씩 들어가서 중간에 공백리스트가 하나씩 들어간다.
    data2[i]=data2[i*2]   #이를 해결하기 위함.

data2=data2[0:400]

"""
data 출력
1번
print (data)

2번
print (data2)
"""

#CountVectorizer 
count_vect = CountVectorizer()
X1= count_vect.fit_transform(data).toarray()
count_vect = CountVectorizer()
X2 = count_vect.fit_transform(data2).toarray()

#split
X_train, X_test, y_train, y_test = train_test_split(X1, target, test_size=0.2)
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, target, test_size=0.2)

#Model Learning
clf = MultinomialNB().fit(X_train, y_train)
clf2 = MultinomialNB().fit(X2_train, y2_train)

#Evaluation 3번
y_pred = clf.predict(X_test)
print(metrics.classification_report(y_test, y_pred, target_names=target_name))

y2_pred = clf2.predict(X2_test)
print(metrics.classification_report(y2_test, y2_pred, target_names=target_name))

accuracys = cross_val_score(MultinomialNB(), X1, target, cv=5, scoring='accuracy')
print(accuracys.mean())
accuracys2 = cross_val_score(MultinomialNB(), X2, target, cv=5, scoring='accuracy')
print(accuracys2.mean())


  'precision', 'predicted', average, warn_for)


             precision    recall  f1-score   support

     건강과 의학       0.00      0.00      0.00         3
         경제       0.46      0.96      0.62        25
         과학       0.00      0.00      0.00         2
         교육       1.00      0.33      0.50         3
     문화와 종교       0.83      0.42      0.56        12
         사회       0.73      0.85      0.79        13
         산업       0.33      0.09      0.14        22

avg / total       0.52      0.54      0.46        80

             precision    recall  f1-score   support

     건강과 의학       0.00      0.00      0.00         1
         경제       0.58      0.72      0.64        25
         과학       0.00      0.00      0.00         2
         교육       1.00      0.20      0.33         5
     문화와 종교       0.80      0.75      0.77        16
         사회       0.64      0.64      0.64        14
         산업       0.37      0.41      0.39        17

avg / total       0.59      0.59      0.57        80

0.482042055712
0.520236644079
