# Open Text File with encoding handling

Nautilus includes a document loader that handles encoding detection and multiple file loading

## Create example files 

In [1]:
# Example of a latin1 file
s = "J'aime les frites bien grasse étalon châpeau!"
encoded_s = s.encode('latin-1')
with open('somefile.txt', 'wb') as f:
    f.write(encoded_s)

In [2]:
# Let's add another document 
s = "Un deuxième exemple de texte en utf-8 cette fois!"
encoded_s = s.encode('utf-8')
with open('someadditionalfile.txt', 'wb') as f:
    f.write(encoded_s)

# Document loader

In [3]:
from nautilus_nlp.utils.file_loader import documents_loader

## Open a file when you know the encoding

In [4]:
# default encoding is UTF-8
documents_loader('someadditionalfile.txt')

'Un deuxième exemple de texte en utf-8 cette fois!'

In [5]:
# If you know the encoding, you can specify it
documents_loader('somefile.txt', encoding='latin-1')

"J'aime les frites bien grasse étalon châpeau!"

## Open a file with encoding detection

If you don't specify encoding, `document_loader()` will try to open it as UTF-8, and if it doesn't work it will try to detect encoding.

In [6]:
documents_loader('somefile.txt')

INFO:root:somefile.txt: detected encoding is ISO-8859-1, with a confidence rate of 0.73


"J'aime les frites bien grasse étalon châpeau!"

In [19]:
# You can prevent document loader from detecting the encoding if UTF-8 fails 
# In this case, it will raise an UnicodeDecodeError
documents_loader('somefile.txt', detectencoding=False)



TypeError: function takes exactly 5 arguments (1 given)

## Open several files

In [9]:
# you can use wildcards to open several documents
documents_loader('*.txt')

INFO:root:somefile.txt: detected encoding is ISO-8859-1, with a confidence rate of 0.73


{'somefile.txt': "J'aime les frites bien grasse étalon châpeau!",
 'someadditionalfile.txt': 'Un deuxième exemple de texte en utf-8 cette fois!'}

In [10]:
# you can also pass a list of filepaths
documents_loader(['somefile.txt','someadditionalfile.txt'])

INFO:root:somefile.txt: detected encoding is ISO-8859-1, with a confidence rate of 0.73


{'somefile.txt': "J'aime les frites bien grasse étalon châpeau!",
 'someadditionalfile.txt': 'Un deuxième exemple de texte en utf-8 cette fois!'}

In [11]:
# you can specify the output format when you load multiple texts
documents_loader('*.txt', output_as='list')

INFO:root:somefile.txt: detected encoding is ISO-8859-1, with a confidence rate of 0.73


["J'aime les frites bien grasse étalon châpeau!",
 'Un deuxième exemple de texte en utf-8 cette fois!']

## List files in a folder

In [12]:
from nautilus_nlp.utils.file_loader import list_files

In [13]:
list_files('.') # list files from current folders

['./Visualization tools.ipynb',
 './Sentiment analysis using pre-trained models.ipynb',
 './somefile.txt',
 './Language_identification.ipynb',
 './1. Text Preprocessing.ipynb',
 './3. Text Vectorization - TF-IDF.ipynb',
 './TopicModeling.ipynb',
 './Sentiment_analysis_FT.ipynb',
 './Benchmark text processing tools.ipynb',
 './0. Text file loader.ipynb',
 './someadditionalfile.txt',
 './2. Text processing.ipynb',
 './Spacy_model.ipynb']

In [14]:
list_files('../tests/testfolder_fileloader/.')

[]

In [15]:
list_files('./*.ipynb') # List files matching specific pattern

['./Visualization tools.ipynb',
 './Sentiment analysis using pre-trained models.ipynb',
 './Language_identification.ipynb',
 './1. Text Preprocessing.ipynb',
 './3. Text Vectorization - TF-IDF.ipynb',
 './TopicModeling.ipynb',
 './Sentiment_analysis_FT.ipynb',
 './Benchmark text processing tools.ipynb',
 './0. Text file loader.ipynb',
 './2. Text processing.ipynb',
 './Spacy_model.ipynb']

In [16]:
# only files will be printed, not folders
list_files('/Users/hugo/Documents/NAUTILUS/nautilus-nlp/')

[]

## Detect encoding 

If you just interested in detecting encoding, you can use this function, based on the Chardet library. 

In [17]:
from nautilus_nlp.utils.file_loader import detect_encoding

In [18]:
detect_encoding('somefile.txt')

{'encoding': 'ISO-8859-1', 'confidence': 0.73, 'language': ''}