# DSL Text Analysis
- **Created by: Andrés Segura Tinoco**
- **Created on: Aug 20, 2020**
- **Data: Dictionary of the Spanish language**

#### Descriptive Analysis:
1. Approximate number of words in the DSL
2. Number of words with acute accent in Spanish language
3. Top 5 bigger words
4. Frequency of words per size
5. Frequency of words per letter of the alphabet
6. Frequency of letters in DSL words

In [1]:
# Load Python libraries
import re
import io
import codecs
from collections import Counter

## 0. Load words from Dictionary of the Spanish language

In [2]:
# Util function - Read a plain text file
def read_file_lines(file_path):
    lines = []
    
    with codecs.open(file_path, encoding='utf-8') as f:
        for line in f:
            lines.append(line)
    
    return lines

In [3]:
# Util function - Data quality
def apply_dq_word(word):
    new_word = word.replace('\n', '')
    
    # Get first token
    if ',' in new_word:
        new_word = new_word.split(',')[0]
    
    # Remove extra whitespaces
    new_word = new_word.strip()
    
    # Remove digits
    while re.search("\d", new_word):
        new_word = new_word[0:len(new_word)-1]
        
    return new_word

In [4]:
# Range of files
letters = list(map(chr, range(97, 123)))
letters.append('ñ')
len(letters)

27

## 1. Approximate number of words in the DSL

In [5]:
# Read words by letter [a-z]
word_dict = Counter()
file_path = '../data/dics/'

for letter in letters:
    filename = file_path + letter + '.txt'
    word_list = read_file_lines(filename)
    
    for word in word_list:
        word = apply_dq_word(word)
        word_dict[word] += 1

# Show results
n_words = len(word_dict)
print('Total of different words: %d' % n_words)

Total of different words: 88192


## 2. Number of words with acute accent in Spanish language

In [6]:
# Counting words with acute accent
count = 0
regexp = re.compile('[áéíóúÁÉÍÓÚ]')

for word in word_dict.keys():
    if regexp.search(word):
        count += 1

# Show results
print('Total of accented words: %d (%0.2f)' % (count, (100.0 * count / n_words)))

Total of accented words: 16334 (18.52)


---
<a href="https://ansegura7.github.io/DSL_Analysis/">« Home</a>