**写在前面:**, 这次新闻文本分类的比赛是由DataWhale与阿里天池联合举办，定位为nlp入门级赛事，[这里有详细的赛题说明](https://tianchi.aliyun.com/competition/entrance/531810/information)

## 问题置顶
- 怎么选择df某一行的内容并将其转换为string类型
- 须要分batch训练吗，分batch的目的是什么？
- sklearn中分类器，比如LR对train_X 和 train_y 有什么要求

### 赛题理解
- 任务目标：文本分类（14类）
- 数据格式：每条样本由text和label组成；text进行了字符级的匿名处理（由数字表示原来的字符），label由0~13这14个数字组成，表示财经等14类文本。
- 评价标准：f1  越高越好
- 可选思路：TF-IDF+LR等传统分类模型；word2vec做特征+RNN+softmax；bert做特征+softmax分类

### task2 数据读取与数据分析——on 0722
- 查看训练集结构及规模
- 获取最大text长度
- 建立词典，获取词频

In [1]:
import pandas as pd

In [2]:
train_path = 'train_set.csv/train_set.csv'
train_set = pd.read_csv(train_path, encoding = 'utf-8')
train_set.head()

Unnamed: 0,label\ttext
0,2\t2967 6758 339 2021 1854 3731 4109 3792 4149...
1,11\t4464 486 6352 5619 2465 4802 1452 3137 577...
2,3\t7346 4068 5074 3747 5681 6093 1777 2226 735...
3,2\t7159 948 4866 2109 5520 2490 211 3956 5520 ...
4,3\t3646 3055 3055 2490 4659 6065 3370 5814 246...


In [3]:
train_set.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Data columns (total 1 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   label	text  200000 non-null  object
dtypes: object(1)
memory usage: 1.5+ MB


In [4]:
#看一下多少样本，每个样本几个属性
train_set.shape[0]

200000

In [5]:
#分别得到train_text 和 train_label

#train_set.loc[0]['label\ttext']
train_text, train_label = [],[]
for i in range(20000):
    tmp = train_set.loc[i]['label\ttext'].split('\t')
    a, b = tmp[1], tmp[0]
    train_label.append(b)
    train_text.append(a)
    
#将test list化
train_text_lst = []
for i in range(20000):
    train_text_lst.append(train_text[i].split())
len(train_text_lst)

20000

In [6]:
#获取最大的text长度
def get_len_text(text_list):
    Max = 0
    for i, it in enumerate(text_list):
        Max= len(it) if Max<len(it) else Max
    return Max
get_len_text(train_text_lst)

44665

In [7]:
#先把所有word都放到all_words里面
all_words = []
for i in train_text_lst:
    for j in i:
        all_words.append(j)
print("一共有"+str(len(all_words))+"个words.")

一共有18092357个words.


In [8]:
#建立词典 词典中的每一个item()是 word:[id, freq]
from collections import Counter
def build_vocab(all_words):
    cnt = Counter(all_words) #得到一个字典 word:freq
    res = {}
    for word,freq in cnt.items():
        res[word] = [len(res), freq]
    return res
print("词典中包含 " + str(len(build_vocab(all_words))) + " 个单词" )

#词典 voc
voc = build_vocab(all_words)
                  
def id2word(id,voc):
    for word,lst in voc.items():
        if lst[0] == id:
            return word
def word2id(word,voc):
    return voc[word][0]
print(id2word(2,voc))
print(word2id('667',voc))

词典中包含 5697 个单词
339
2231


In [9]:
word2id('339',voc)

2

### task3  基于机器学习的文本分类 ——on 0725
- 构建特征（采用word_count / TF-IDF两种方式）  
- 尝试用sklearn库中的lr svm xgb 进行分类 

#### 3.1 encoding by wordcnt

In [10]:
def encoding_by_wordcnt(train_set,voc):
    res = []
    for it in train_set:
        tmp = [0]*len(voc)
        for i in it:
            index = word2id(i,voc)
            tmp[index]+=1
        res.append(tmp)
    return res

In [11]:
# 保存encoding 后的train_set
train_encoding_by_wordcnt = encoding_by_wordcnt(train_text_lst,voc)

In [12]:
df = pd.DataFrame(train_encoding_by_wordcnt)

In [13]:
df.to_csv('train_set_encooding_by_wordcnt.csv')

In [None]:
def save_encoding(encoding_lst,csv_path):
    df = pd.DataFrame(encoding_lst)
    df.to_csv(csv_path)

In [14]:
# 加载并且放到list里
def load_encoding(csv_path):
    df =  pd.read_csv(csv_path)
    res = []
    for i in range(df.shape[0]):
        lst = df.loc[0].to_list()
        res.append(lst)
    return res

In [15]:
#make_model 并 训练
from sklearn.linear_model import LogisticRegression
def build_and_train_encoding_by_wordcnt(train_x,train_y):
    lr = LogisticRegression()
#train_x,train_y = train_encoding_by_wordcnt,train_label
    lr.fit(train_x, train_y)
    return lr   

In [29]:
# 评估函数
import sklearn.metrics
def model_metrics(model, test_x, test_y):
    y_pred = model.predict(test_x)
    return sklearn.metrics.f1_score (test_y,y_pred,labels = [i for i in range(14)], average = 'macro')

In [19]:
#划分训练集与测试集，
from sklearn.model_selection import train_test_split
train_x,train_y = train_encoding_by_wordcnt,train_label
X_train, X_test, y_train, y_test = train_test_split(train_x,train_y,test_size=0.3, random_state = 2018)

In [36]:
%%time
lr = build_and_train_encoding_by_wordcnt(X_train,y_train)

Wall time: 28 s


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [37]:
model_metrics(lr, X_test, y_test)

  mask &= (ar1 != a)


0.8445580663131584

#### 3.2 encoding by TF-IDF

In [31]:
from sklearn.feature_extraction.text import TfidfTransformer 

#查看数据结构 tfidf[i][j]表示i类文本中的tf-idf权重  
def encoding_by_tfidf(X):
    transformer = TfidfTransformer() 
#将词频矩阵X统计成TF-IDF值  
    tfidf = transformer.fit_transform(X)
    return tfidf

In [32]:
train_encoding_by_tfidf = encoding_by_tfidf(train_encoding_by_wordcnt)

In [35]:
train_x_tfidf,train_y_tfidf = train_encoding_by_tfidf.toarray(),train_label
X_train_tfidf, X_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(train_x_tfidf,train_y_tfidf,test_size=0.3, random_state = 2018)

In [39]:
lr_tfidf = build_and_train_encoding_by_wordcnt(X_train_tfidf,y_train_tfidf)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [40]:
model_metrics(lr_tfidf, X_test_tfidf, y_test_tfidf)

  mask &= (ar1 != a)


0.8622178600615233