# NLP
### 什么是自然语言处理
- 计算机科学、人工智能和语言学领域的一个分支，它致力于使计算机能够理解、解释和生成人类语言。
![image-2.png](attachment:image-2.png)
### 发展史
![image-11.png](attachment:image-11.png)

- 1950年代
    - 1950年代中期：艾伦·图灵提出了图灵测试，这是判断机器是否具有智能的一个标准，也为后来的自然语言处理研究奠定了基础。
- 1970年代
    - 1970年代：语法和规则为基础的，能够处理更复杂的语言现象，但它们的通用性和鲁棒性有限。

- 1990年代
    - 1990年代：随着互联网的兴起，可用于NLP研究的数据量大幅增加，统计方法开始占据主导地位。
    - 1993：第一个支持向量机（SVM）被提出，这是一种强大的分类算法，后来被广泛应用于NLP任务中。
- 2000年代
    - 2001：谷歌推出了其网络搜索引擎，使用了链接分析和词频-逆文档频率（TF-IDF）等NLP技术。
    - 2006：深度学习的概念被重新引入，随着计算能力的提升和大数据的出现，深度神经网络开始在NLP中扮演越来越重要的角色。
- 2010年代
    - 2010年代：随着深度学习技术的发展，如循环神经网络（RNN）和卷积神经网络（CNN），NLP领域取得了显著进展。
    - 2013：Word2Vec模型的提出，使得单词可以被表示为连续的向量，这在很多NLP任务中都取得了很好的效果。
    - 2014：序列到序列（Seq2Seq）模型的提出，为机器翻译等序列生成任务提供了新的解决方案。
    - 2018：BERT（Bidirectional Encoder Representations from Transformers）模型的提出，它使用预训练的Transformer模型来创建深度的语言表示，极大地推动了NLP的发展。
- 2020年代
    - 2020年代：NLP领域继续快速发展，预训练模型如GPT-3等具有数十亿参数的模型被开发出来，它们能够在没有明确监督的情况下生成高质量的文本。

- NLP 的目标是让计算机能够处理和分析自然语言数据，以便更好地理解和响应人类的语言
- NLP有两个子领域 NLG 和 NLU ，分别代表自然语言生成（Natural Language Generation）和自然语言理解（Natural Language Understanding）
![image-4.png](attachment:image-4.png)
### 应用
- Sentimental Analysis
- chatbot
- Speech recognition
- Machine Translation 
### 如何工作
- 标记化- 将输入序列分割为单词、短语、句子等（称为标记）的过程
- 词干提取- 删除和替换后缀以获得称为词干的词根形式的过程
- 词形还原- 此过程使用词汇或形态分析并返回称为词元的单词的字典形式
- 命名实体识别 -检测命名实体（例如人名、公司名称、数量或位置等）的过程。
- 分块 -将不同的信息片段或标记放在一起并将它们组合成更大的片段（称为组块）的过程
- 停用词删除

### NLP深度学习技术
![image-10.png](attachment:image-10.png)

### NLP框架
- NLTK    最早用 Python 编写的 NLP 库之一 . 提供易于使用的接口。它还提供了一套用于分类、标记、词干、解析和语义推理的文本处理库
- spaCy   最通用的开源 NLP 库之一。它支持超过 66 种语言。spaCy 还提供预训练的词向量并实现许多流行的模型
- TensorFlow Pytorch     可以更轻松地创建具有自动微分等功能的模型。这些库是开发 NLP 模型最常用的工具
- Hugging Face            提供超过 135 个最先进模型的开源实现和权重。该存储库可以轻松定制和训练模型

## 相关任务
### 语料库(Corpus)
    - 语料库（Corpus）是一个用于研究的大型文本集合。 通常包含大量的文本数据，这些数据可以是书籍、文章、网页、对话记录等，涵盖了广泛的主题、风格和语言使用场景。
    - 主要用于语言模型训练，词性标注和语法分析
###  词库(Lexicon)
    - 词汇表或词典。它是一个包含特定语言中所有单词或术语的列表。对于不同行业也不同的词库
### 标记(Token)
    - Token是指从文本中分离出的基本单元，通常是单词、标点符号、数字或任何其他有意义的字符序列。文本的标记化（Tokenization）过程就是将文本分解成这些单元的过程。 例如“Hello, how are you?” 可以被分词为 Hello   ,  how  are you ?
### 停止词(Stop Words)
    - Stop Words通常是指在文本中出现频次较高，但对文本内容意义无关的词。例如 the is are in on at and of to for 等
    - 我们也可以自定义不同的停止词
    - 停止词在任务中都会被移除，以便于提升计算效率。
### 词干提取(stemming) 
    - Stemming一般发生在文本预处理阶段，用于去除单次的词尾变化例如时态(ing,ed)，单复数(s, es), ly 
    - 其目的用于减少词汇多样性， 提高计算，检索性能
    - 可能在某些情况下丢失单词的一些语义信息
### 词性标注Part-of-speech (POS) Tagging
    - 磁性标注用于将文本中的单次分配一个词性(名词 动词 形容词 副词)
    - 词性标注的正确性直接影响到后续处理的质量
    例如，英文句子 “The cat sits on the mat” 可以被词性标注为：

    The → 冠词 (Determiner)
    cat → 名词 (Noun)
    sits → 动词 (Verb)
    on → 介词 (Preposition)
    the → 冠词 (Determiner)
    mat → 名词 (Noun)

### 分块 chunking
    - 将文本分割成更大的片段，这些片段通常是具有语法意义的短语或从句。介于词性标注(POS Tagging)和完全语法分析(Full Syntax Parsing)之间
    - 其目的是将词语按照语法组成有意义的单元。如名词短语，动词短语，介词短语等。有助于将原始文本转换为更结构化的格式，从而便于进一步的语义分析。
    - 分块通常是通过定义一系列规则来实现的，这些规则基于词性标签序列。例如一个名词短语可以由一个或多个形容词后跟上名词组成
    - 例如 The white cat sit on the table
        将会分块为[The white cat]名词短语  sit [on the table]介词短语
 
### 命名实体识别(NER - Named Entity Recognition)
    - 识别文本中具有特定意义的实体，并将它们分类到预定义的类别中，如人名、地点、组织、时间、金额等
    - 通过识别文本中的命名实体，可以更深入地理解文本的内容和上下文
    - 例如：" Greg went to  Southeast University and XunWu lake last week"
        
            Greg -- >人名
            Southeast University -->地名
            XunWu lake --> 地名
    - 命名实体识别可以通过基于规则的方法实现，也可以通过机器学习方法，如条件随机场（CRF）、深度学习模型（如双向LSTM、BERT）等。这些方法通常需要大量的标注数据进行训练，以便模型能够学习如何从文本中识别和分类命名实体。
    
### 词性还原(Lemmatization)
    - 将单词的变形形式还原为其基本形式
    - 例如running 还原为run 
### 词性还原 vs 词干提取
    - 都是用于将单词还原到其基本形式的技术的术语
                        词性还原                                           词干提取 
       目的       将单词还原到其字典形式                               通过去除单词的后缀（如-s，-ed，-ing等）来简化单词形式
       方式       通常涉及对单词的形态学分析，包括词性和词形变化规则        使用一组固定的规则来剪切单词的后缀
       优势       通常提供更准确的词形还原,保证语意完整                   速度快，实现简单，不需要大量的语言资源。
       劣势       过程更复杂，计算成本更高，需要访问词典和词性标注信息       可能无法正确处理所有单词，并丢失语义
### 文本特征提取

- 为什么要做特征提取
    - 在机器计算过程中一般输入输出都是固定长度，而文本内容却都是非结构化，此时我们需要将文本数值化 尤其是数值向量
    - 通过特征提取有助于简化数据，提高模型训练效率
    - 提高可解释性。 通过特征提取，我们更难容易理解模型的整个决策过程，从而提供模型的可解释性
- 词袋模型 bag of words 
    - ag-of-Words 计算每个单词在文档中出现的次数。
    - 不考虑文本中词与词之间的上下文关系，仅仅只考虑所有词出现的频率）
   ![image-6.png](attachment:image-6.png)
    - 词袋模型在处理文本分类和文本相似度上效率很高。计算向量距离
    - 词袋模型需要一个很大的词库
    - 文本相似度比较准确度不高


- TF-IDF Term Frequency-Inverse Document Frequency）是一种用于评估一个单词对于一个文档集合中一个文档的重要性的统计方法。它结合了两个指标： ）。
    - TF - 词频（Term Frequency，TF）
        - 单词是否对当前文档重要？  TF =  词t在文档中出现的次数 / 文档中的单词总数

        ![image-7.png](attachment:image-7.png)
    
    
    - IDF - 逆文档频率（Inverse Document Frequency
        - 单词在整个文档集合中的独特性?   IDF = log(文档总数/出现该单词的文档树)
        ![image-8.png](attachment语义信息的捕捉
        - :image-8.png)
    
    - TF-IDF = TF * IDF
    ![image-9.png](attachment:image-9.png)
    
- 词向量
    - 将单词映射到高维空间中的稠密向量。这些向量能够捕捉到单词的语义和上下文信息，使得在数学上可以计算单词之间的相似性和关联性
    - 词向量关注单词的语义和上下文信息，适用于需要理解单词含义和关系的任务
    - 词向量通常需要大量的文本数据进行预训练 Word2Vec

- 总结
    - 语义信息的捕捉。 词向量能够捕捉到单词的语义信息，这意味着语义相似的单词在向量空间中彼此靠近
    - 上下文依赖。 词向量能够根据单词的上下文变化，从而为同一个单词提供不同的向量表示
    - 词向量支持向量运算，这意味着可以通过向量加法和减法来探索词之间的关系
    - 




## NLTK(Natural Language Toolkit)
    - 一个python自然语言处理库，用于处理和分析文本数据
    - NLTK包含了大量的语料库和预训练的模型，可以方便地用于NLP任务的演示和开发。
    - 提供了一系列的模块和函数，可以用来处理和分析文本数据
    - 被广泛用于学术研究和教学
    - 
    

    

    
 
    


### 安装nltk

=
![image.png](attachment:image.png)



- corpora  这个文件夹包含NLTK处理语料库的代码
- tokenize 这个文件夹包含NLTK的文本分词和分句代码
- tag      这个文件夹包含NLTK的词性标注和命名实体识别代码
- stemmer  文件夹通常包含用于词干提取和词形还原的模块和数据 PorterStemmer


In [None]:
pip install nltk

import nltk
nltk.download()

https://github.com/nltk/nltk 


### 语料库

In [84]:
# 加载语料库

from nltk.tokenize import sent_tokenize, PunktSentenceTokenizer
from nltk.corpus import state_union

# sample text
sample = state_union.raw("2005-GWBush.txt")

tok = sent_tokenize(sample)

for x in range(100):
    print(tok[x])

PRESIDENT GEORGE W. BUSH'S ADDRESS BEFORE A JOINT SESSION OF THE CONGRESS ON THE STATE OF THE UNION
 
February 2, 2005


9:10 P.M. EST 

THE PRESIDENT: Mr. Speaker, Vice President Cheney, members of Congress, fellow citizens: 

As a new Congress gathers, all of us in the elected branches of government share a great privilege: We've been placed in office by the votes of the people we serve.
And tonight that is a privilege we share with newly-elected leaders of Afghanistan, the Palestinian Territories, Ukraine, and a free and sovereign Iraq.
(Applause.)
Two weeks ago, I stood on the steps of this Capitol and renewed the commitment of our nation to the guiding ideal of liberty for all.
This evening I will set forth policies to advance that ideal at home and around the world.
Tonight, with a healthy, growing economy, with more Americans going back to work, with our nation an active force for good in the world -- the state of our union is confident and strong.
(Applause.)
Our generation has

### 词库
    - WordNet 是英语的词汇数据库，由普林斯顿创建的同义词词典，是 NLTK 语料库的一部分
    - 可以一起使用 WordNet 和 NLTK 模块来查找单词含义，同义词，反义词

In [85]:
from nltk.corpus import wordnet
syns = wordnet.synsets("good") #查看nice的同义词
print(syns)
print(syns[3].definition()) # 解释意义
print(syns[3].examples()) 
print_seperator()
synonyms = []
antonyms = []
for w in syns:
    for l in w.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
print(synonyms)
print(antonyms)



[Synset('good.n.01'), Synset('good.n.02'), Synset('good.n.03'), Synset('commodity.n.01'), Synset('good.a.01'), Synset('full.s.06'), Synset('good.a.03'), Synset('estimable.s.02'), Synset('beneficial.s.01'), Synset('good.s.06'), Synset('good.s.07'), Synset('adept.s.01'), Synset('good.s.09'), Synset('dear.s.02'), Synset('dependable.s.04'), Synset('good.s.12'), Synset('good.s.13'), Synset('effective.s.04'), Synset('good.s.15'), Synset('good.s.16'), Synset('good.s.17'), Synset('good.s.18'), Synset('good.s.19'), Synset('good.s.20'), Synset('good.s.21'), Synset('well.r.01'), Synset('thoroughly.r.02')]
articles of commerce
[]


NameError: name 'print_seperator' is not defined

### 标记

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize


comment = "The actor's act was impressive. i like his show!!. "

sent_token = sent_tokenize(comment)  # 将文本分割成句子。 基于符号进行标记
print(sent_token)
print('***********')
word_token = word_tokenize(comment) # 单词标记

print(word_token)

def print_seperator():
    print('********************************************')


### 停用词

In [130]:
# 停用词
from nltk.corpus import stopwords #停用词

stop_words = set(stopwords.words('english'))# 英语中所有的停用词
#print(stop_words) 



def filtered_stop_words(str):
    
    tokens = word_tokenize(str)
    return [w for w in tokens if not w in stop_words]




str='This is a sample sentence, showing off the stop words filtration. '

filtered_words = filtered_stop_words(str) #停用词删掉后的token
print(filtered_words)

# 添加自定义停用词
manual_stop_words= {'This'}
manual_stop_words = manual_stop_words.union(stop_words)
filtered_words2 = [w for w in filtered_words if not w in manual_stop_words] #停用词删掉后的token

print(filtered_words2)





['This', 'sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']
['sample', 'sentence', ',', 'showing', 'stop', 'words', 'filtration', '.']


### 词干提取  词形还原

In [92]:
# 词干提取
# 词干的概念是一种规范化方法。 除涉及时态之外，许多词语的变体都具有相同的含义。
from nltk.stem import PorterStemmer
str = 'I am a pythoner, my daily work is to write python scripts, when i was pythoning, i can feel it is pythonly'
filtered_words = filtered_stop_words(str)
print(filtered_words)
#通过停止词过滤
ps = PorterStemmer()

# 词干提取
for w in filtered_words:
    print(ps.stem(w))

 
# 词性还原
from nltk.stem import WordNetLemmatizer
for w in filtered_words:
    lemm = WordNetLemmatizer()
    print(lemm.lemmatize(w))
   



# pos参数，对于一个单词存在多种词性(n adj v)可以通过pos参数指定
str1 = 'xiaoming is a good programer, his python skill is better than java skill,his defect is better than others'
filtered_words = filtered_stop_words(str1)

count=0
result=''
for w in filtered_words:
    lemm = WordNetLemmatizer()
    if w == 'better':
        if count == 0:
            w=lemm.lemmatize(w,pos='a') # 句子中第一个better他们把他还原成形容词 --> good
            count +=1
            result+=w+" "
        else:
            w=lemm.lemmatize(w,pos='n') #句子中第2个better他们把他还原成名词 --> better
            result+=w+" "
    else:
        lemm.lemmatize(w)
        result+=w+" "
print(result)







['I', 'pythoner', ',', 'daily', 'work', 'write', 'python', 'scripts', ',', 'pythoning', ',', 'feel', 'pythonly']
i
python
,
daili
work
write
python
script
,
python
,
feel
pythonli
I
pythoner
,
daily
work
write
python
script
,
pythoning
,
feel
pythonly
xiaoming good programer , python skill good java skill , defect better others 


### 词性标注Part-of-speech (POS) Tagging
    - 可以为你做词性标注。 意思是把一个句子中的单词标注为名词，形容词，动词等。它也可以按照时态来标记，以及其他
    POS tag list:

- CC    coordinating conjunction 并列连词
- CD    cardinal digit
- DT    determiner
- EX    existential there (like: "there is" ... think of it like "there exists")
- FW    foreign word
- IN    preposition/subordinating conjunction
- JJ    adjective    'big' 形容词 
- JJR    adjective, comparative    'bigger' 形容词比较级
- JJS    adjective, superlative    'biggest' 形容词最高级
- LS    list marker    1)
- MD    modal    could, will
- NN    noun, singular 'desk' 单个形容词
- NNS    noun plural    'desks' 形容词复数
- - NNP    proper noun, singular    'Harrison'  人名、地名、组织名或其他专有名词
- NNPS    proper noun, plural    'Americans' 人名、地名、组织名或其他专有名词复数
- PDT    predeterminer    'all the kids'
- POS    possessive ending    parent's
- PRP    personal pronoun    I, he, she
- PRP    possessive pronoun    my, his, hers
- RB    adverb    very, silently,
- RBR    adverb, comparative    better
- RBS    adverb, superlative    best
- RP    particle    give up
- TO    to    go 'to' the store.
- UH    interjection    errrrrrrrm
- VB    verb, base form    take
- VBD    verb, past tense    took
- VBG    verb, gerund/present participle    taking
- VBN    verb, past participle    taken
- VBP    verb, sing. present, non-3d    take
- VBZ    verb, 3rd person sing. present    takes
- WDT    wh-determiner    which
- WP    wh-pronoun    who, what
- WP   possessive wh-pronoun    whose
- WRB    wh-abverb    where, when

In [93]:
# 标记器 PunktSentenceTokenizer 可以在你使用的任何文本上进行实际的训练
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import state_union #引入语料库
from nltk.tokenize import PunktSentenceTokenizer 

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")
custom_sent_tokenizer = PunktSentenceTokenizer(train_text)
tokenized = custom_sent_tokenizer.tokenize(sample_text)


def process_content():
    try:
        for i in tokenized[:1]:
            words = word_tokenize(i)
            tagged = nltk.pos_tag(words) # 打标
            print(tagged)

    except Exception as e:
        print(str(e))


process_content()

# [('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'NNP'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'NNP'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'NNP'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'DT'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')] [('Mr.', 'NNP'), ('Speaker', 'NNP'), (',', ','), ('Vice', 'NNP'), ('President', 'NNP'), ('Cheney', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('Congress', 'NNP'), (',', ','), ('members', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('Supreme', 'NNP'), ('Court', 'NNP'), ('and', 'CC'), ('diplomatic', 'JJ'), ('corps', 'NNS'), (',', ','), ('distinguished', 'VBD'), ('guests', 'NNS'), (',', ','), ('and', 'CC'), ('fellow', 'JJ'), ('citizens', 'NNS'), (':', ':'), ('Today', 'NN'), ('our', 'PRP$'), ('nation', 'NN'), ('lost', 'VBD'), ('a', 'DT'), ('beloved', 'VBN'), (',', ','), ('graceful', 'JJ'), (',', ','), ('courageous', 'JJ'), ('woman', 'NN'), ('who', 'WP'), ('called', 'VBN'), ('America', 'NNP'), ('to', 'TO'), ('its', 'PRP$'), ('founding', 'NN'), ('ideals', 'NNS'), ('and', 'CC'), ('carried', 'VBD'), ('on', 'IN'), ('a', 'DT'), ('noble', 'JJ'), ('dream', 'NN'), ('.', '.')] [('Tonight', 'NNP'), ('we', 'PRP'), ('are', 'VBP'), ('comforted', 'VBN'), ('by', 'IN'), ('the', 'DT'), ('hope', 'NN'), ('of', 'IN'), ('a', 'DT'), ('glad', 'NN'), ('reunion', 'NN'), ('with', 'IN'), ('the', 'DT'), ('husband', 'NN'), ('who', 'WP'), ('was', 'VBD'), ('taken', 'VBN'), ('so', 'RB'), ('long', 'RB'), ('ago', 'RB'), (',', ','), ('and', 'CC'), ('we', 'PRP'), ('are', 'VBP'), ('grateful', 'JJ'), ('for', 'IN'), ('the', 'DT'), ('good', 'NN'), ('life', 'NN'), ('of', 'IN'), ('Coretta', 'NNP'), ('Scott', 'NNP'), ('King', 'NNP'), ('.', '.')] [('(', 'NN'), ('Applause', 'NNP'), ('.', '.'), (')', ':')] [('President', 'NNP'), ('George', 'NNP'), ('W.', 'NNP'), ('Bush', 'NNP'), ('reacts', 'VBZ'), ('to', 'TO'), ('applause', 'VB'), ('during', 'IN'), ('his', 'PRP$'), ('State', 'NNP'), ('of', 'IN'), ('the', 'DT'), ('Union', 'NNP'), ('Address', 'NNP'), ('at', 'IN'), ('the', 'DT'), ('Capitol', 'NNP'), (',', ','), ('Tuesday', 'NNP'), (',', ','), ('Jan', 'NNP'), ('.', '.')]



[('PRESIDENT', 'NNP'), ('GEORGE', 'NNP'), ('W.', 'NNP'), ('BUSH', 'NNP'), ("'S", 'POS'), ('ADDRESS', 'NNP'), ('BEFORE', 'IN'), ('A', 'NNP'), ('JOINT', 'NNP'), ('SESSION', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('CONGRESS', 'NNP'), ('ON', 'NNP'), ('THE', 'NNP'), ('STATE', 'NNP'), ('OF', 'IN'), ('THE', 'NNP'), ('UNION', 'NNP'), ('January', 'NNP'), ('31', 'CD'), (',', ','), ('2006', 'CD'), ('THE', 'NNP'), ('PRESIDENT', 'NNP'), (':', ':'), ('Thank', 'NNP'), ('you', 'PRP'), ('all', 'DT'), ('.', '.')]


#### 分块(Chunk)
    - 块的主要目标之一是将所谓的“名词短语”分组。 这些是包含一个名词的一个或多个单词的短语，可能是一些描述性词语，也可能是一个动词，也可能是一个副词
    - 可以通过正则表达式结合词性标签
    +  1个或多个
    ?  0或1 个
    *  0个或多个  
    . 任意字符

In [131]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize("Greg went to  Southeast University and XuanWU lake last week")


def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words) # 打标
            '''
            <NNP.?>*：零个或一个任何时态的副词，后面是：

            <VB.?>*：零个或一个任何时态的动词，后面是：

            <NNP>+：表示一个或多个专有名词，后面是：

            <NN>?：零个或一个名词单数。
            
            '''
            chunk_gram = r"""Chunk: {<NNP.?>*<VBD.?>*<NNP>+<NN>?}""" # 
            chunk_parser = nltk.RegexpParser(chunk_gram)
            chunked = chunk_parser.parse(tagged)
            chunked.draw()     
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)
    except Exception as e:
        print(str(e))



process_content()

(Chunk Greg/NNP)
(Chunk Southeast/NNP University/NNP)
(Chunk XuanWU/NNP)


#### 添加缝隙(chunking)
    - 添加缝隙与分块很像，它基本上是一种从块中删除块的方法。 你从块中删除的块就是你的缝隙
    - }{ 的方式添加删除内容

In [170]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize("Greg went to  Southeast University and XuanWU lake last week")


def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words) # 打标
            '''
            <RB.?>*：零个或一个任何时态的副词，后面是：

            <VB.?>*：零个或一个任何时态的动词，后面是：

            <NNP>+：表示一个或多个专有名词，后面是：

            <NN>?：零个或一个名词单数。
            
            '''
            #chunk_gram = r"""Chunk: {<RB.?>*<VB.?>*<NNP>+<NN>?}""" # 
            chunk_gram = r"""Chunk: {<.*>+}
                                    }<VB.?|IN|DT|TO>+{""" # 从缝隙中删除一个或多个动词，介词，限定词或to这个词
            chunk_parser = nltk.RegexpParser(chunk_gram)
            chunked = chunk_parser.parse(tagged)
            #chunked.draw()     
            for subtree in chunked.subtrees(filter=lambda t: t.label() == 'Chunk'):
                print(subtree)
    except Exception as e:
        print(str(e))



process_content()

(Chunk Greg/NNP)
(Chunk Southeast/NNP University/NNP and/CC XuanWU/NNP)
(Chunk last/JJ week/NN)


#### 命名实体识别
    - 让机器立即能够拉出“实体”，例如人物，地点，事物，位置，货币等等

In [132]:
import nltk
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer

train_text = state_union.raw("2005-GWBush.txt")
sample_text = state_union.raw("2006-GWBush.txt")

custom_sent_tokenizer = PunktSentenceTokenizer(train_text)

tokenized = custom_sent_tokenizer.tokenize("Greg went to Peking University and XuanWu Lake last week ")

def process_content():
    try:
        for i in tokenized:
            words = nltk.word_tokenize(i)
            tagged = nltk.pos_tag(words)
            namedEnt = nltk.ne_chunk(tagged, binary=False)
            # namedEnt.draw()
            print(namedEnt)
    except Exception as e:
        print(str(e))


process_content()

(S
  (GPE Greg/NNP)
  went/VBD
  to/TO
  (ORGANIZATION Peking/NNP University/NNP)
  and/CC
  (ORGANIZATION XuanWu/NNP Lake/NNP)
  last/JJ
  week/NN)


#### BOW

In [103]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk import sent_tokenize
#vectorizer = CountVectorizer() 
vectorizer = TfidfVectorizer() 

paragraph = """The fox jumps over the lazy dog. Dog and fox are lazy"""
corpus = sent_tokenize(paragraph)
X = vectorizer.fit_transform(corpus) 

feature_names = vectorizer.get_feature_names_out() 

X_array = X.toarray() 

print ( "唯一单词列表: \n" , feature_names) 
print ( "词袋矩阵：\n",X_array)

唯一单词列表: 
 ['and' 'are' 'dog' 'fox' 'jumps' 'lazy' 'over' 'the']
词袋矩阵：
 [[0.         0.         0.25948224 0.25948224 0.36469323 0.25948224
  0.36469323 0.72938646]
 [0.53309782 0.53309782 0.37930349 0.37930349 0.         0.37930349
  0.         0.        ]]


#### TF-IDF模型

In [12]:
import heapq
import numpy as np
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
from  nltk.corpus import stopwords

# stop word
stop_words = set(stopwords.words('english'))
manual_stop_words= {'applause'}
manual_stop_words=manual_stop_words.union(stop_words)

speech = state_union.raw("2005-GWBush.txt")
sentence = sent_tokenize(speech)

for i in range(len(sentence)):
    sentence[i]= sentence[i].lower() #转小写 
    sentence[i]= re.sub(r'\W',' ',sentence[i])
    sentence[i]= re.sub(r'\s',' ',sentence[i])


word_count={}



# 计数
for data in sentence:
    words = word_tokenize(data)
    words = [w for w in words if w not in manual_stop_words] #过滤掉无效的停止词
    for word in words:
        if word not in word_count.keys():
            word_count[word]=1
        else:
            word_count[word] +=1

# 频次最高的单词
freq_words = heapq.nlargest(100, word_count,key =word_count.get )
#for fw in freq_words:
#    print(fw,word_count[fw] )

print_seperator()
# TF-IDF  
# TD = numbers of word in current doc/total words in doc
# IDF = log(total number of docs/number of docs which contains the word)

word_idfs={}

# 计算IDF
for fw in freq_words: #遍历频次最高的单词
    doc_count = 0 # 单词出现在文档中的次数
    for sent in sentence: #遍历每个句子
        if fw in word_tokenize(sent):# 判断单词是否在句子当中。 这里每个句子我们当做一个文档doc
            doc_count+=1
    word_idfs[fw]=np.log(len(sentence) / doc_count+1)

print(word_idfs)

# 计算TF 

word_tfs={}
for fw in freq_words:
    doc_tf=[]
    for sent in sentence:
        frequency = 0
        words = word_tokenize(sent)
        words = [w for w in words if w not in manual_stop_words] 
        for w in words:
            if w == fw:
                frequency+=1
        tf_word_value= frequency/len(word_tokenize(sent))
        doc_tf.append(tf_word_value)
    word_tfs[fw] = doc_tf

print(word_tfs)

    


NameError: name 'print_seperator' is not defined

### word2vec

In [1]:
import heapq
import numpy as np
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import state_union
from nltk.tokenize import PunktSentenceTokenizer
from  nltk.corpus import stopwords

 
speech = '''
The Benefits of Reading.Reading is a vital skill that offers numerous benefits. It expands our knowledge and understanding of the world. Through books, we can explore different cultures, gain insights into historical events, and learn about various subjects. Reading also enhances our vocabulary and language skills, improving our communication abilities.

In addition to these cognitive benefits, reading has a profound impact on our emotional well-being. It allows us to escape into different worlds, experiencing a wide range of emotions through the characters and stories we encounter. This escapism can be a valuable form of relaxation and stress relief.

Moreover, reading fosters creativity and imagination. As we immerse ourselves in the narratives, our minds create vivid images and scenarios, stimulating our creative thinking. This imaginative exercise can have a positive influence on our personal and professional lives, encouraging innovation and problem-solving.

In conclusion, reading is not just a hobby but an essential activity that enriches our lives in countless ways. It is a tool for personal growth, emotional regulation, and intellectual development. Therefore, we should make it a priority to read regularly and encourage others to do the same.
'''
sentences = sent_tokenize(speech)

for i in range(len(sentence)):
    # sentences[i]= sentence[i].lower() #转小写 
    sentences[i]= re.sub(r'\W',' ',sentences[i])
    sentences[i]= re.sub(r'\s',' ',sentences[i])
    sentences[i] = filter_stop_words(sentences[i])
    
    
    
from gensim.models import Word2Vec

 
words = [word_tokenize(sent) for sent in sentences] 


model = Word2Vec(words,min_count=1,vector_size=300,window=10)

vector = model.wv['but']
print(vector)
#simi = model.wv.most_similar('2tt')
#print(simi)

ModuleNotFoundError: No module named 'nltk'

### 情感分析

In [129]:
import json
from nltk.corpus import state_union
from nltk.tokenize import word_tokenize
import json
import os
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords #停用词
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
from sklearn.model_selection import learning_curve
import numpy as np


from sklearn.metrics import accuracy_score
data_dir = '/users/zhuyang/nltk_data/corpora/state_union'
file_name = 'greg.json'


file_path = os.path.join(data_dir, file_name)

with open(file_path, 'r') as file:
    data = json.load(file)
    

reviews=[]
labels=[]

for item in data:
    review = item['review']
    label = item['rating']
    reviews.append(review)
    labels.append(label)


# 预处理数据
stop_words = set(stopwords.words('english'))
# 删除停用词
def filter_stop_words(review):
    words = word_tokenize(review)
    filtered_tokens = [token for token in words if token not in stop_words]
    return ' '.join(filtered_tokens)

filtered_reviews = [filter_stop_words(review ) for review in reviews]
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer 

# 词袋模型
#vector = CountVectorizer()
# 特征提取基于tf-idf
vector = TfidfVectorizer()


X = vector.fit_transform(filtered_reviews)
y = labels


# 划分数据
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.22, random_state=42)  

# 训练
from sklearn.naive_bayes import MultinomialNB #多项式贝叶斯
from sklearn.tree import DecisionTreeClassifier # 决策树
from sklearn.ensemble import RandomForestClassifier # 随机森林
from sklearn.svm import SVC
classifier = MultinomialNB() #0.72
#classifier = DecisionTreeClassifier()# 0.64
#classifier = RandomForestClassifier(n_estimators=20, random_state=42) #0.74
#classifier = SVC(kernel='linear', C=1.0, random_state=42) #0.76
classifier.fit(X_train,y_train)

#预测
y_pred = classifier.predict(X_test)

# 打印准确度
print(accuracy_score(y_test,y_pred))


# 找出预测错误的样本
incorrect_predictions = [(y_t, y_p) for y_t, y_p in zip(y_test, y_pred) if y_t != y_p]

correct_predictions = [(y_t, y_p) for y_t, y_p in zip(y_test, y_pred) if y_t == y_p]

# 打印预测错误的样本数量
print(f'错误预测: {len(incorrect_predictions)}')
print(f'总共预测: {len(y_pred)}')
 
'''    
for y_true, y_pred in correct_predictions:
    #print(f'True: {y_true}, Predicted: {y_pred}')
    print('-' * 50)
'''    


def predict_comment(comment):
    comment = filter_stop_words(comment)
    feature = vector.transform([comment])
    prediction = classifier.predict(feature)
    return prediction


print(predict_comment("suggest to add a extra function into the APP which allow me to check UP.TO.DATE TOTAL AMOUNT SMRT REBATES REDEEMED & THE AVAILABLE AMOUNT OF BALANCED SMRT REBATES LEFT FOR THE ANNUAL CAP OF $600 FROM MY SMRT VISA"))

0.7283813747228381
错误预测: 245
总共预测: 902
[1]


### 利用nltk分析情感

In [128]:
import json
from nltk.corpus import state_union
from nltk.tokenize import word_tokenize
import json
import os
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords #停用词
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.metrics import classification_report
from nltk.sentiment import SentimentIntensityAnalyzer

from sklearn.metrics import accuracy_score
data_dir = '/users/zhuyang/nltk_data/corpora/state_union'
file_name = 'greg.json'


file_path = os.path.join(data_dir, file_name)

with open(file_path, 'r') as file:
    data = json.load(file)
    

reviews=[]
labels=[]

for item in data:
    review = item['review']
    label = item['rating']
    reviews.append(review)
    labels.append(label)


# 预处理数据
stop_words = set(stopwords.words('english'))
# 删除停用词
def filter_stop_words(review):
    words = word_tokenize(review)
    filtered_tokens = [token for token in words if token not in stop_words]
    return ' '.join(filtered_tokens)

filtered_reviews = [filter_stop_words(review ) for review in reviews]

sia = SentimentIntensityAnalyzer()
for i in range(len(filtered_reviews)):
    review = filtered_reviews[i]
    score = sia.polarity_scores(review)
    rate = labels[i]
    print(score,review,'label=',rate)
    
'''
pos 是积极情感的分数。
neg 是消极情感的分数。
neu 是中性情感的分数
compound 是一个综合分数，它表示文本的整体情感倾向。这个分数是一个介于 -1.0 和 1.0 之间的浮点数，其中：


compound = ((pos-neu)-(neg-neu))/(pos+neu-neg)

分数接近 1.0 表示文本具有非常积极的情感。
分数接近 0.0 表示文本是中性的，没有明显的积极或消极情感。
分数接近 -1.0 表示文本具有非常消极的情感。

'''

{'neg': 0.0, 'neu': 0.854, 'pos': 0.146, 'compound': 0.7212} Downloaded app activated digital card Citi PremierMiles . The temporary cvv popped I didnt take notice . Subsequent attempts view temporary cvv led pop activate physical card instead . So see card number , expiry date cvv ! ! ! ! Messaged customer center said needed escalate relevant teams solve . Seems like bug . label= 1
{'neg': 0.211, 'neu': 0.617, 'pos': 0.172, 'compound': -0.7368} -1 Stars . * * Stop telling user send email support . U Tell Your Support Team The Issue ! ! * * .. Will give negative star option . Now even phone keyboard also want monitor . Why Google Keyboard also allow ! ! ? ? Useless lousy banking apps SG ! Slow loading user friendly ! .. Not able use Developer Option enable ! Sometimes need enable Developer Option certain function phone . Citibank So Disappointed In You ! 😡 label= 1
{'neg': 0.042, 'neu': 0.88, 'pos': 0.078, 'compound': 0.2598} The Citibank IT set-up deplorable . Functions app n't work .

{'neg': 0.187, 'neu': 0.813, 'pos': 0.0, 'compound': -0.7178} Suddenly able login recent update keeps saying insecured connection Pixel 4A . I latest software update turned VPN try login still failed . Uninstalled moved Citi mobile app iPad . Really inconvenient . label= 1
{'neg': 0.135, 'neu': 0.748, 'pos': 0.117, 'compound': 0.1561} app n't seems allow quit properly . click back phone , n't prompt want quit like bank app . option download e-statement label= 2
{'neg': 0.197, 'neu': 0.727, 'pos': 0.075, 'compound': -0.6908} Slow going login screen .. Now I try update downloads failed update . So I uninstall try reinstall . Same issue downloads fails update . Cleared google play store cache data restart samsung galaxy s9 But problem still label= 2
{'neg': 0.0, 'neu': 0.542, 'pos': 0.458, 'compound': 0.9136} Brilliantly simple fast app . However lacks ability show full loyalty points gives total . points . Here Amex app much better . label= 4
{'neg': 0.0, 'neu': 0.917, 'pos': 0.083, 'com

{'neg': 0.276, 'neu': 0.724, 'pos': 0.0, 'compound': -0.7574} Very bad , keep issue logging . Encountered error even though password login id correct . Only use internet browser access account . label= 1
{'neg': 0.215, 'neu': 0.785, 'pos': 0.0, 'compound': -0.7202} I log password app loops back login page . Waste time . They dont even tell outstanding balance tell make payment , yet still try sell products services . label= 1
{'neg': 0.155, 'neu': 0.775, 'pos': 0.07, 'compound': -0.3182} Hi I 'm back . Okay I think nubbad . Chatbox little laggy . Service representative quality hit miss I guess . Still annoyed I got timed replying cause Chatbox crashed tho . label= 3
{'neg': 0.2, 'neu': 0.667, 'pos': 0.133, 'compound': -0.0742} App n't work . Have trying cancel credit card automated self service goes round round without helping label= 1
{'neg': 0.0, 'neu': 0.734, 'pos': 0.266, 'compound': 0.7003} Kindly note sometimes u make payment credit card savings account . The payment amount deduc

{'neg': 0.101, 'neu': 0.899, 'pos': 0.0, 'compound': -0.4215} Fails get past loading page biometric authentication . Update : After emailing customer service per reply , I provided reply said customer service provided `` general inquiries products services . '' label= 1
{'neg': 0.162, 'neu': 0.703, 'pos': 0.136, 'compound': -0.4788} 1.App Pay NTUC income bill , keep prompt error . Ca n't pay . 2.App show amk yio chu Kang town council , ca n't find amk town council , cant pay bill 3.axs pay bill allow Citi master visa , visa useless label= 2
{'neg': 0.327, 'neu': 0.673, 'pos': 0.0, 'compound': -0.5267} Can even activate physical card . Fix stupid app label= 1
{'neg': 0.341, 'neu': 0.638, 'pos': 0.02, 'compound': -0.9276} Recently frustrated way cards managed ! Pissed . Citibank blocked card without informing . Only got know failed online transaction via merchandise . I personnally find RUDE Gesture citibank . My x2 cards affected . Will monitor situation . Not sure , serious Citibank cu

{'neg': 0.556, 'neu': 0.444, 'pos': 0.0, 'compound': -0.3612} Slow user unfriendly label= 1
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0} Keep sending OTP I received nothing via phone . label= 1
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0} It crashes everytime I login label= 1
{'neg': 0.0, 'neu': 0.394, 'pos': 0.606, 'compound': 0.4728} Very fast efficient label= 5
{'neg': 0.404, 'neu': 0.596, 'pos': 0.0, 'compound': -0.5256} App keep crashing ... So disappointed label= 1
{'neg': 0.0, 'neu': 0.125, 'pos': 0.875, 'compound': 0.7906} Quick & easy , great label= 5
{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0} The apps took long time load , label= 3
{'neg': 0.324, 'neu': 0.556, 'pos': 0.12, 'compound': -0.4939} Lousy app . Ca n't even find hotline number label= 1
{'neg': 0.423, 'neu': 0.577, 'pos': 0.0, 'compound': -0.296} No update , slow update label= 1
{'neg': 0.0, 'neu': 0.385, 'pos': 0.615, 'compound': 0.4939} user friendly fast label= 5
{'neg': 0.0, 'neu': 0.

'\npos 是积极情感的分数。\nneg 是消极情感的分数。\nneu 是中性情感的分数\ncompound 是一个综合分数，它表示文本的整体情感倾向。这个分数是一个介于 -1.0 和 1.0 之间的浮点数，其中：\n\n\ncompound = ((pos-neu)-(neg-neu))/(pos+neu-neg)\n\n分数接近 1.0 表示文本具有非常积极的情感。\n分数接近 0.0 表示文本是中性的，没有明显的积极或消极情感。\n分数接近 -1.0 表示文本具有非常消极的情感。\n\n'