## Natural Language Processing Analysis of State's Curricula in Brazil

### Step 01 : Text Wrangling

Cleanning the data: 

Tokenizing
Tagger
Parser
Ner

Remove Punctuation

In [1]:
import spacy
import numpy as np
import pandas as pd
import seaborn as sns
import os
import glob

from string import punctuation
from collections import Counter

import regex as re

In [2]:
import nltk
nltk.download('words')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Francisco\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Francisco\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Francisco\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Francisco\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Francisco\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!


In [3]:
# Setup in Brazilian Portuguese

import pt_core_news_sm
nlp = pt_core_news_sm.load()
path = r'C:\Users\Francisco\Desktop\Python\0. TCC ANALYSIS\Corpus\*.txt'
nlp.max_length = 5384682

In [4]:
# Setting .txt as a Corpus

for file in glob.glob(path):
    with open(file, encoding='utf-8', errors='ignore') as file_in:
        text = file_in.read()
        lines = text.split('\n')
        for line in lines:
            line = nlp(line)                

Next chunk will have spaCy apply its entire NLP "pipeline" to the text as soon as it is provided to the model and outputs a processed "doc."

<img src="https://d33wubrfki0l68.cloudfront.net/3ad0582d97663a1272ffc4ccf09f1c5b335b17e9/7f49c/pipeline-fde48da9b43661abcdf62ab70a546d71.svg">

In [5]:
# Processed Doc

doc = nlp(text)

In [14]:
# Words Object

words = [token.text for token in doc]
words[:5]

['\ufeffCURRÍCUlS^', '\n', 'PAULISTA', 'II', 'II']

In [15]:
# Sentences Object

sentences = [sent.text for sent in doc.sents]
sentences[:5]

['\ufeffCURRÍCUlS^\n',
 'PAULISTA II II',
 '\n',
 'SECRETARIA DA EDUCAÇÃO DO ESTADO DE SÃO PAULO\n',
 'CURRÍCULO']

## Cleaning

In [22]:
# Remove Punctuation and Spaces
no_punct_or_space = [token for token in doc if token.is_punct == False and token.is_space == False]

# Remove Numbers and lower case everything
lower_alpha = [token.lower_ for token in no_punct_or_space if token.is_alpha == True]

# Remove Custom Stop Words (if needed)

#    custom_stopwords = ["assyrian", "babylon"]
#    custom_clean = [token for token in clean if token not in custom_stopwords]
#    custom_clean

In [24]:
# Cleaned Corpus

clean = [token.lower_ for token in no_punct_or_space if token.is_alpha == True and token.is_stop == False]
clean[:30]

['paulista',
 'ii',
 'ii',
 'secretaria',
 'educação',
 'paulo',
 'currículo',
 'paulista',
 'etapa',
 'ensino',
 'médio',
 'paulo',
 'governo',
 'governador',
 'joão',
 'dória',
 'secretário',
 'educação',
 'rossieli',
 'soares',
 'silva',
 'secretário',
 'executivo',
 'haroldo',
 'corrêa',
 'rocha',
 'chefe',
 'gabinete',
 'renilda',
 'peres']

## Counting

In [25]:
print("Number of tokens in document: ", len(doc))
print("Number of tokens in cleaned document: ", len(clean))
print("Number of unique tokens in cleaned document: ", len(set(clean)))

Number of tokens in document:  119957
Number of tokens in cleaned document:  52071
Number of unique tokens in cleaned document:  6418


In [29]:
from collections import Counter

cleaned_counter = Counter(clean)
cleaned_counter.most_common(10)

[('diferentes', 469),
 ('estudante', 435),
 ('ensino', 341),
 ('vida', 313),
 ('processos', 299),
 ('textos', 291),
 ('social', 270),
 ('conhecimentos', 270),
 ('sociais', 269),
 ('produção', 269)]

In [30]:
print(clean)

['paulista', 'ii', 'ii', 'secretaria', 'educação', 'paulo', 'currículo', 'paulista', 'etapa', 'ensino', 'médio', 'paulo', 'governo', 'governador', 'joão', 'dória', 'secretário', 'educação', 'rossieli', 'soares', 'silva', 'secretário', 'executivo', 'haroldo', 'corrêa', 'rocha', 'chefe', 'gabinete', 'renilda', 'peres', 'lima', 'subsecretário', 'articulação', 'regional', 'henrique', 'cunha', 'pimentel', 'filho', 'coordenador', 'pedagógico', 'caetano', 'pansani', 'siqueira', 'coordenadora', 'escola', 'formação', 'aperfeiçoamento', 'profissionais', 'educação', 'cristina', 'cassia', 'mabelini', 'silva', 'coordenadora', 'gestão', 'recursos', 'humanos', 'cristty', 'anny', 'hayon', 'coordenador', 'informação', 'tecnologia', 'evidência', 'matrícula', 'thiago', 'guimarães', 'cardoso', 'coordenador', 'infraestrutura', 'serviços', 'escolares', 'daniel', 'medeiros', 'coordenador', 'orçamento', 'finanças', 'william', 'bezerra', 'melo', 'uníao', 'dirigentes', 'municipais', 'educação', 'presidente', 'u