# Detector de língua

## Objetivo

Elaborar uma estratégia e construa um sistema que seja capaz de responder a qual idioma uma frase qualquer pertence.

Por Exemplo:

`eu gosto de café` → `por`

`I like coffee`→ `eng`

## Bibliotecas

In [66]:
# manipulação de dados
import pandas as pd

# manipulação de texto

# modelo
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

## Base de Dados

Dessa vez encontrei uma base de dados contendo frases em seus respectivos idiomas Afrikaans, English e Nederlands (africano, inglês e holandês). Queria achar alguma em português mas não consegui.

Disponível no seguinte repositório do github: https://github.com/rolandgem/LanguageClassification/tree/main

## Leitura dos dados 

In [4]:
df = pd.read_csv('lang_data.csv')
df.head()

Unnamed: 0,text,language
0,Ship shape and Bristol fashion,English
1,Know the ropes,English
2,Graveyard shift,English
3,Milk of human kindness,English
4,Touch with a barge-pole - Wouldn't,English


In [5]:
df.shape

(2839, 2)

In [8]:
df.duplicated().sum()

84

In [16]:
df.drop_duplicates(inplace=True)
df.duplicated().sum()

0

In [17]:
df.isnull().sum()

text        3
language    0
dtype: int64

In [18]:
df.dropna(inplace=True)
df.isnull().sum()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(inplace=True)


text        0
language    0
dtype: int64

In [19]:
df.shape

(2752, 2)

In [20]:
percentual_lingua = df['language'].value_counts(normalize=True) * 100
percentual_lingua


English       74.382267
Afrikaans     23.183140
Nederlands     2.434593
Name: language, dtype: float64

**Temos muito pouca amostra de textos em holandês**

## Preparação dos textos

### Stop Words

Aqui teremos que separar as 3 línguas e usar pacotes para remover as stop words dos 3 idiomas distintos.

In [41]:
df1 = df.copy()

df_africano = df1[df1['language'] == 'Afrikaans']
df_ingles = df1[df1['language'] == 'English']
df_holandes = df1[df1['language'] == 'Nederlands']

Encontrei os stop words no repositório do github -> https://github.com/6/stopwords-json?tab=readme-ov-file

Tentei usar o pacote nltk, mas acho que eles não tem para africanos e holandeses então resolvi usar o que tem todos no mesmo lugar.

In [42]:
# Começando com os africanos
with open('stop_words/afrikaans_stopwords.txt', 'r', encoding='utf-8') as arquivo:
    stopwords_lista = arquivo.read().splitlines()

# Função para remover as stop words do texto
def remover_stopwords(texto):
    palavras = texto.split()
    palavras_sem_stopwords = [palavra for palavra in palavras if palavra.lower() not in stopwords_lista]
    return ' '.join(palavras_sem_stopwords)


df_africano['text'] = df_africano['text'].apply(remover_stopwords)

df_africano.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_africano['text'] = df_africano['text'].apply(remover_stopwords)


Unnamed: 0,text,language
5,Sy kan altyd my battery natpiepie.,Afrikaans
16,Piet Pompies van Soetmelksvlei is nie van hier...,Afrikaans
28,Die lewe is 10 % wat met ons gebeur en 90 % on...,Afrikaans
31,"Saam met ons, rondom ons, in ons en by ons is ...",Afrikaans
35,"Vroeg uit die bed, maak die beursie vet!",Afrikaans


In [43]:
# Inglês
with open('stop_words/english_stopwords.txt', 'r', encoding='utf-8') as arquivo:
    stopwords_lista = arquivo.read().splitlines()

# Função para remover as stop words do texto
def remover_stopwords(texto):
    palavras = texto.split()
    palavras_sem_stopwords = [palavra for palavra in palavras if palavra.lower() not in stopwords_lista]
    return ' '.join(palavras_sem_stopwords)


df_ingles['text'] = df_ingles['text'].apply(remover_stopwords)

df_ingles.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_ingles['text'] = df_ingles['text'].apply(remover_stopwords)


Unnamed: 0,text,language
0,Ship shape and Bristol fashion,English
1,Know the ropes,English
2,Graveyard shift,English
3,Milk of human kindness,English
4,Touch with a barge-pole - Wouldn't,English


In [44]:
# Holandês
with open('stop_words/nederlands_stopwords.txt', 'r', encoding='utf-8') as arquivo:
    stopwords_lista = arquivo.read().splitlines()

# Função para remover as stop words do texto
def remover_stopwords(texto):
    palavras = texto.split()
    palavras_sem_stopwords = [palavra for palavra in palavras if palavra.lower() not in stopwords_lista]
    return ' '.join(palavras_sem_stopwords)


df_holandes['text'] = df_holandes['text'].apply(remover_stopwords)

df_holandes.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_holandes['text'] = df_holandes['text'].apply(remover_stopwords)


Unnamed: 0,text,language
64,"Je moet geen oude schoenen weggooien, voordat ...",Nederlands
93,"Wie zaait, zal oogsten.",Nederlands
119,"Wie het eerst komt, het eerst maalt.",Nederlands
143,Gedane zaken nemen geen keer.,Nederlands
170,Aan alles komt een eind.,Nederlands


### Juntandos os dataframes de volta

In [46]:
df_stop = pd.concat([df_africano, df_ingles, df_holandes])
df_stop

Unnamed: 0,text,language
5,Sy kan altyd my battery natpiepie.,Afrikaans
16,Piet Pompies van Soetmelksvlei is nie van hier...,Afrikaans
28,Die lewe is 10 % wat met ons gebeur en 90 % on...,Afrikaans
31,"Saam met ons, rondom ons, in ons en by ons is ...",Afrikaans
35,"Vroeg uit die bed, maak die beursie vet!",Afrikaans
...,...,...
2516,De laatste druppel doet de emmer overlopen.,Nederlands
2751,Kleine potjes hebben grote oren.,Nederlands
2776,Schijn bedriegt.,Nederlands
2797,Iets is beter dan niets.,Nederlands


## Representação Bag of Word

In [48]:
countvec = CountVectorizer(ngram_range = (1,2))
countvec_data = countvec.fit_transform(df_stop['text'])

In [52]:
# Conversão da matriz esparsa para DataFrame
df_bow = pd.DataFrame(countvec_data.toarray(), 
                      columns=countvec.get_feature_names_out())
df_bow.index = df_stop.index
df_bow = df_bow.join(df_stop[['language']], how='left')
df_bow.head()


Unnamed: 0,10,10 wat,1500s,1500s folk,1ste,1ste graad,22,90,90 ons,99,...,zwijgen is,àugur,àugur well,één,één ding,één klap,één paard,één rede,één vogel,language
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Afrikaans
16,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Afrikaans
28,1,1,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,Afrikaans
31,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Afrikaans
35,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,Afrikaans


# Criação do modelo 

In [55]:
language_map = {'Afrikaans': 0, 'English':1, 'Nederlands':2}
df_bow['language'] = df_bow['language'].map(language_map)
df_bow.head()

Unnamed: 0,10,10 wat,1500s,1500s folk,1ste,1ste graad,22,90,90 ons,99,...,zwijgen is,àugur,àugur well,één,één ding,één klap,één paard,één rede,één vogel,language
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
28,1,1,0,0,0,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
31,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
35,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Definição do target e features

In [56]:
# Target 
y = df_bow['language']

# Features
x = df_bow.drop('language', axis=1)

## Divisão treino e teste

In [59]:
x_train, x_test, y_train, y_test = train_test_split(x, 
                                                    y, 
                                                    test_size=0.30, 
                                                    random_state=42)

In [60]:
(x_train.shape, x_test.shape, y_train.shape, y_test.shape)

((1926, 13375), (826, 13375), (1926,), (826,))

## Criação de treinamento do modelo 

In [63]:
modelo = GradientBoostingClassifier(n_estimators=100,
                                        subsample=0.5,
                                        random_state=42)
modelo.fit(x_train, y_train)

In [64]:
y_pred = modelo.predict(x_test)
y_pred

array([1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 2, 1,
       0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,

In [65]:
accuracy = accuracy_score(y_test, y_pred)
print("Acurácia:", accuracy)

Acurácia: 0.9297820823244553


In [68]:
confusion_matrix(y_test, y_pred)

array([[153,  40,   0],
       [  1, 610,   0],
       [  0,  17,   5]], dtype=int64)