# Portuguese lemmatizer with spaCy

This Jupyter notebook takes a text file, or folder of text files in Portuguese, and creates a set of lemmatized derivative files (where all words are in their dictionary form, and not inflected). These lemmatized files can then be used for searching, or other computational text analysis methods.

For this notebook, we'll use the natural language processing (NLP) library spaCy, which can support multiple different languages.

## First-time setup: spaCy library
The code cells below below install the *spacy* package which can do the actual lemmatizing. You only need to run it the first time you use this notebook in a particular environment (laptop, virtual machine, etc.) You can skip it the next time you use the notebook, but nothing bad will happen if you re-run it.

In [1]:
#Imports the module you need to download and install the spaCy modules
import sys
#Installs spaCy
!{sys.executable} -m pip install spacy



## First-time setup: download language data
The cell below downloads the Portuguese data for the spaCy NLP library. 

You can also check out [other language data available for spaCy](https://spacy.io/models), but you'll need to make a few other modifications to the code later on in order to use it. If you're new to Python, you might want to try one of the [other lemmatizer notebooks](https://github.com/quinnanya/dlcl204/tree/master/notebooks).

In [2]:
import spacy
!{sys.executable} -m spacy download pt_core_news_sm

Collecting pt-core-news-sm==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/pt_core_news_sm-3.2.0/pt_core_news_sm-3.2.0-py3-none-any.whl (22.2 MB)


2022-02-01 16:21:51.294236: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-02-01 16:21:51.294292: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


[+] Download and installation successful
You can now load the package via spacy.load('pt_core_news_sm')


## Importing modules
The next code cells imports the modules you need to run this notebook. Run them every time you open this notebook. 

In [3]:
#os is used to navigate your folder directories (e.g. change folders to where you files are stored)
import os

#imports spaCy
import spacy

#imports the Portuguese model
import pt_core_news_sm

#sets up a function so you can run the Portuguese model on texts
ptnlp = pt_core_news_sm.load(disable = ["ner", "tagger"])

In [4]:
ptnlp.max_length = 2557645 # or even higher

The code cell below imports modules that are useful for cleaning texts, particularly texts converted from ebooks that include typographically fancy single and double quote characters that can break the lemmatization. It also sets up a function, `clean_string`, that we'll use on the text before running the lemmatizer.

If your text doesn't have any of these problematic characters, nothing will happen (and that's okay).

In [5]:
#Imports a module for converting Unicode characters to ASCII (English alphabet & punctuation)
import unidecode
#Imports a module with data about all the Unicode characters
import unicodedata
#Imports a module that does regular expressions, a kind of fancy find-and-replace syntax
import re

#Defines a character filter function
def char_filter(string):
    #Defines the set of Latin characters
    latin = re.compile('[a-zA-Z]+')
    #For each character in the text...
    for char in unicodedata.normalize('NFC', string):
        #Convert it to its ASCII equivalent
        decoded = unidecode.unidecode(char)
        #If the ASCII equivalent is a letter
        if latin.match(decoded):
            #Print the original character (so you don't lose accented letters)
            yield char
        #Otherwise...
        else:
            #Print the ASCII equivalent (e.g. standard quote character)
            yield decoded

#Defines a function for cleaning a text
def clean_string(string):
    #Runs the character filter function and reassmbles the text
    return "".join(char_filter(string))

## Lemmatizing a single file
Put the full path to your text file in the cell below, using the correct syntax for your operating system. 

For instance, the default path a text file in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/YOUR-TEXT-FILE.txt'
* On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents\\\YOUR-TEXT-FILE.txt'

In [6]:
# #Put the full path to your file between the single quotes here
# filepath = 'C:\Users\Francisco\Desktop\Python\FinalPaper\Analysis\MG.txt.txt'

# #The outname is the name of the lemmatized file that this notebook creates
# #If you want it to be named something other than the original file name + -lemmatized
# #you can change that here
# outname = filepath.replace('.txt', '-lemmatized.txt')

In [7]:
# #Opens the file you specified
# with open(filepath, 'r', encoding='utf8') as f:
#     #Creates an empty text file with -lemmatized.txt appended to the name
#     with open(outname, 'w', encoding='utf8') as out:
#         #Reads the text of the file you specified
#         text = f.read()
#         #Removes any problematic punctuation
#         cleantext = clean_string(text)
#         #Does Portuguese NLP on the cleaned text
#         doc = ptnlp(cleantext)
#         #For each word in the text...
#         for token in doc:
#             #Write the lemma to the new text file with the lemmatized text
#             out.write(token.lemma_)
#             #Write a space after each word
#             out.write(' ')
#             #Print the lemmas to the screen below, with a space between them
#             print(token.lemma_, end=' ')

## Lemmatizing a folder of text files
Put the full path to your folder of text files in the cell below, using the correct syntax for your operating system. 

For instance, the default path to a folder called "portuguese" in the Documents directory is (substituting your user name on the computer for YOUR-USER-NAME):

* On Mac: '/Users/YOUR-USER-NAME/Documents/portuguese'
* On Windows: 'C:\\\Users\\\YOUR-USER-NAME\\\Documents\\\portuguese'

With a whole folder of texts, printing the full text to the screen can make your Jupyter notebook file get very big, so it's been "turned off" here. If you want to see the script at work, you can remove the `#` character before the `#print(token.lemma_, end=' ')` at the end of the second cell below.

In [8]:
#Put the full path to your folder between single quotes here
textfolder = r'C:\Users\Francisco\Desktop\Python\FinalPaper\Analysis'
#Changes the working directory to the folder you specified
os.chdir(textfolder)

In [9]:
ptnlp.Defaults.stop_words |= {"Rede Estadual", "estadual de educação", "Estadual de Educação", "ESTADUAL DE EDUCAÇÃO", "unidade curricular", "ensino médio", "rede estadual", "governo", "secretaria", "secretaria educação", "base nacional", "comum curricular"}

In [10]:
#For every file in the folder you specified...
for filename in os.listdir(textfolder):
    #If it's a text file, but not one of the text files with just lemmas
    if filename.endswith('.txt') and not filename.endswith('-lemmatized.txt'):
        #The outname is the name of the lemmatized file that this notebook creates
        #If you want it to be named something other than the original file name + -lemmatized
        #you can change that here
        outname = filename.replace('.txt', '-lemmatized.txt')
        #Opens the file you specified
        with open(filename, 'r', encoding='utf8') as f:
            #Creates an empty text file with -lemmatized.txt appended to the name
            with open(outname, 'w', encoding='utf8') as out:
                #Reads the text of the file you specified
                text = f.read()
                #Removes any problematic punctuation
                cleantext = clean_string(text)
                #Does Portuguese NLP on the cleaned text
                doc = ptnlp(cleantext)
                #For each word in the text...
                for token in doc:
                    if token.is_stop == False and token.is_punct == False and token.is_alpha == True:
                    #Write the lemma to the new text file with the lemmatized text
                        out.write(token.lemma_)
                    #Write a space after each word
                        out.write(' ')
                    #Print the lemmas to the screen below, with a space between them
                    #print(token.lemma_, end=' ')

## About

This Jupyter notebook was originally developed by Quinn Dombrowski for use in [DLCL 204: Digital Humanities Across Borders](https://github.com/quinnanya/dlcl204) at Stanford University, fall 2020. 