#### [머신러닝] 언어 판별 모델 

#### 언어학
- 알파벳을 사용하는 언어들은 사용하는 알파벳의 빈도가 다름
- 언어별 알파벳 빈도에 따라서 언어 식별 가능

#### 데이터셋 

- 언어 : 영어(en), 프랑스어(fr), 인도네시아어(id), 타칼로그어(tr)
- lang.zip 

- train => 나라명-숫자.txt ( en-1.txt ! en-5.txt )

- test  => 나라명-숫자.txt ( en-1.txt ! en-5.txt )

- 어떻게 데이터들을 받을 것인가? 어떻게 이걸 처리를 할 것인가?
- 인터넷 검색을 했을 때, 2가지 경우로 나뉘었음
    - 첫번째 방법은 소문자 알바벳의 빈도수로 언어분류 --> 기호가 들어간 문자는 판별 할 수 없는 문제가 있음
    - 두번째 방법은 띄어쓰기 단위로 언어를 학습하고 분류하는 방법 --> 전처리에서 특수기호와 숫자를 어떻게 처리 할 것인지의 문제가 있음.
- 전처리만 할 수 있다면 두번째 방법을 이용하고 싶음.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
train_data_path = '../data/Language/train/*.txt'
test_data_path = '../data/Language/test/*.txt'

In [3]:
with open('../data/Language/train/en-1.txt', 'r', encoding='utf-8') as f:
    text = f.read()
    

In [4]:
text

'\n\n\n\nThe main Henry Ford Museum building houses some of the classrooms for the Henry Ford Academy\n\n\nHenry Ford Academy is the first charter school in the United States to be developed jointly by a global corporation, public education, and a major nonprofit cultural institution. The school is sponsored by the Ford Motor Company, Wayne County Regional Educational Service Agency and The Henry Ford Museum and admits high school students. It is located in Dearborn, Michigan on the campus of the Henry Ford museum. Enrollment is taken from a lottery in the area and totaled 467 in 2010.[1]\nFreshman meet inside the main museum building in glass walled classrooms, while older students use a converted carousel building and Pullman cars on a siding of the Greenfield Village railroad. Classes are expected to include use of the museum artifacts, a tradition of the original Village Schools. When the Museum was established in 1929, it included a school which served grades kindergarten to colle

In [5]:
pd.Series(text.split()).value_counts()

the            20
Academy        18
Ford           15
Detroit        15
in             15
               ..
website         1
links[edit]     1
External        1
2010]           1
it.             1
Name: count, Length: 451, dtype: int64

In [6]:
text_list = []
with open('../data/Language/train/en-1.txt', 'r', encoding='utf-8') as f:
    for row in f:
        text_list.append(f.readline().split('\n')[0])

In [7]:
train_en_1_df = pd.DataFrame(text_list, columns=['train_en_1'])
train_en_1_df

Unnamed: 0,train_en_1
0,
1,
2,
3,Henry Ford Academy is the first charter school...
4,The Henry Ford Learning Institute is using the...
...,...
287,t
288,
289,
290,


In [8]:
mask = (train_en_1_df['train_en_1'] == 't') | (train_en_1_df['train_en_1'] == '')

In [9]:
train_en_1_df = train_en_1_df[~mask]
train_en_1_df = train_en_1_df.reset_index(drop=True)
train_en_1_df

Unnamed: 0,train_en_1
0,Henry Ford Academy is the first charter school...
1,The Henry Ford Learning Institute is using the...
2,See also[edit]
3,List of public school academy districts in Mic...
4,References[edit]
...,...
111,Wayne
112,Monroe
113,Michigan
114,Coordinates: 42°18′11.9″N 83°13′52″W﻿ / ﻿42.30...


In [10]:
en_1_string = train_en_1_df['train_en_1'].to_string()

In [11]:
replace_list = ['(', ')', '[', ']', '\\', '/', '...', '\ufeff', '-', ',', '.', ';', '|', '?', '!', '°', '′', '″', ':', "'", '"', '﻿',
                '^', '@', '#', '`', '$', '%', '&', '*', '<', '>']

for entty in replace_list:
    en_1_string = en_1_string.replace(entty, ' ')

In [12]:
en_1_string_list = en_1_string.split()

In [13]:
en_1_string_df = pd.DataFrame(en_1_string_list, columns=['en_1_string'])
en_1_string_df

Unnamed: 0,en_1_string
0,0
1,Henry
2,Ford
3,Academy
4,is
...,...
395,related
396,article
397,is
398,a


In [14]:
en_1_string_df['en_1_string'].unique()

array(['0', 'Henry', 'Ford', 'Academy', 'is', 'the', 'first', 'charter',
       'school', '1', 'The', 'Learning', 'Institute', 'using', '2', 'See',
       'also', 'edit', '3', 'List', 'of', 'public', 'academy',
       'districts', 'in', 'Mic', '4', 'References', '5', 'External',
       'links', '6', 'website', '7', 'High', 'schools', 'Wayne', 'County',
       'Michigan', '8', 'Public', 'high', '9', 'Central', '10', 'Denby',
       '11', '12', 'Mumford', '13', 'Osborn', '14', 'Southeastern', '15',
       'Americas', '16', 'Cass', 'Technical', '17', 'Crosman', '18',
       'Detroit', 'International', 'for', 'Young', 'Women', '19',
       'School', 'Arts', '20', 'Millennium', '21', 'Trombly',
       'Alternative', '22', 'White', 'Center', '23', 'Closed', 'or',
       'merged', '24', 'Cooley', '25', 'City', '26', 'Kettering', '27',
       'Murray', 'Wright', '28', 'Northern', '29', 'Southwestern', '30',
       'Allen', 'Park', '31', 'Carlson', 'Gibraltar', '32',
       'Clarenceville', '33

In [15]:
number_mask = en_1_string_df['en_1_string'].str.isnumeric()

In [16]:
en_1_string_df = en_1_string_df[~number_mask]

In [17]:
en_1_string_df.duplicated().sum()

86

In [18]:
en_1_string_df = en_1_string_df.drop_duplicates().reset_index(drop=True)

In [19]:
en_1_string_df['en_1_string'].unique()

array(['Henry', 'Ford', 'Academy', 'is', 'the', 'first', 'charter',
       'school', 'The', 'Learning', 'Institute', 'using', 'See', 'also',
       'edit', 'List', 'of', 'public', 'academy', 'districts', 'in',
       'Mic', 'References', 'External', 'links', 'website', 'High',
       'schools', 'Wayne', 'County', 'Michigan', 'Public', 'high',
       'Central', 'Denby', 'Mumford', 'Osborn', 'Southeastern',
       'Americas', 'Cass', 'Technical', 'Crosman', 'Detroit',
       'International', 'for', 'Young', 'Women', 'School', 'Arts',
       'Millennium', 'Trombly', 'Alternative', 'White', 'Center',
       'Closed', 'or', 'merged', 'Cooley', 'City', 'Kettering', 'Murray',
       'Wright', 'Northern', 'Southwestern', 'Allen', 'Park', 'Carlson',
       'Gibraltar', 'Clarenceville', 'Dearborn', 'Edsel', 'Fordson',
       'Garden', 'Grosse', 'Ile', 'Pointe', 'South', 'Harper', 'Woods',
       'Lincoln', 'Northville', 'Redford', 'Union', 'River', 'Rouge',
       'Theodore', 'Roosevelt', 'Wyand

In [82]:
# 파일 전처리 하는 함수
def preprocessing_file(path, colname):
    
    # 파일 불러오기
    text_list = []
    with open(path, 'r', encoding='utf-8') as f:
        for row in f:
            text_list.append(f.readline().split('\n')[0])
    
    # 단어 단위 처리
    text_df = pd.DataFrame(text_list, columns=[colname])
    
    # 줄바꿈 흔적 지우기
    mask = (text_df[colname] == 't') | (text_df[colname] == '')
    text_df = text_df[~mask]
    text_df = text_df.reset_index(drop=True)
    
    # 특수기호들 빼려고 문자열로 바꿈
    string = text_df[colname].to_string()
    string = string.lower()
    replace_list = ['(', ')', '[', ']', '\\', '/', '...', '\ufeff', '-', ',', '.', ';', '|', '?', '!', '°', '′', '″', ':', "'", '"', '﻿',
                '^', '@', '#', '`', '$', '%', '&', '*', '<', '>', '+', '–', '·', '██', 'ˈ', '×', '=', 'ɡ', 'ʔ', '’', '«', '↑', '»',
                '‎', 'スター・ウォーズ', 'ローグ', 'スコードロン', '中国']
    for entty in replace_list:
        string = string.replace(entty, ' ')
    # string = ' '.join(char for char in string if not char.isnumeric())
    string_list = string.split()
    
    # 숫자 처리
    string_df = pd.DataFrame(string_list, columns=[colname])
    number_mask = string_df[colname].str.isnumeric()
    string_df = string_df[~number_mask].reset_index(drop=True)
    
    # 문자 한개인 경우 삭제
    for i in range(len(string_df)):
        if len(string_df.loc[i, colname]) == 1:
            string_df = string_df.drop(i, axis=0)
    
    # 중복 삭제
    string_df = string_df.drop_duplicates().reset_index(drop=True)
    
    return string_df

In [83]:
train_en_1_df = preprocessing_file(path='../data/Language/train/en-1.txt', colname='train_en_1')
train_en_2_df = preprocessing_file(path='../data/Language/train/en-2.txt', colname='train_en_2')
train_en_3_df = preprocessing_file(path='../data/Language/train/en-3.txt', colname='train_en_3')
train_en_4_df = preprocessing_file(path='../data/Language/train/en-4.txt', colname='train_en_4')
train_en_5_df = preprocessing_file(path='../data/Language/train/en-5.txt', colname='train_en_5')

train_en_1_df2 = preprocessing_file(path='../data/Language/train/en-1.txt', colname='train_en')
train_en_2_df2 = preprocessing_file(path='../data/Language/train/en-2.txt', colname='train_en')
train_en_3_df2 = preprocessing_file(path='../data/Language/train/en-3.txt', colname='train_en')
train_en_4_df2 = preprocessing_file(path='../data/Language/train/en-4.txt', colname='train_en')
train_en_5_df2 = preprocessing_file(path='../data/Language/train/en-5.txt', colname='train_en')

train_en_df = pd.concat([train_en_1_df, train_en_2_df, train_en_3_df, train_en_4_df, train_en_5_df], axis=1)
train_en_df2 = pd.concat([train_en_1_df2, train_en_2_df2, train_en_3_df2, train_en_4_df2, train_en_5_df2], axis=0)

In [84]:
train_fr_1_df = preprocessing_file(path='../data/Language/train/fr-6.txt', colname='train_fr_1')
train_fr_2_df = preprocessing_file(path='../data/Language/train/fr-7.txt', colname='train_fr_2')
train_fr_3_df = preprocessing_file(path='../data/Language/train/fr-8.txt', colname='train_fr_3')
train_fr_4_df = preprocessing_file(path='../data/Language/train/fr-9.txt', colname='train_fr_4')
train_fr_5_df = preprocessing_file(path='../data/Language/train/fr-10.txt', colname='train_fr_5')

train_fr_1_df2 = preprocessing_file(path='../data/Language/train/fr-6.txt', colname='train_fr')
train_fr_2_df2 = preprocessing_file(path='../data/Language/train/fr-7.txt', colname='train_fr')
train_fr_3_df2 = preprocessing_file(path='../data/Language/train/fr-8.txt', colname='train_fr')
train_fr_4_df2 = preprocessing_file(path='../data/Language/train/fr-9.txt', colname='train_fr')
train_fr_5_df2 = preprocessing_file(path='../data/Language/train/fr-10.txt', colname='train_fr')

train_fr_df = pd.concat([train_fr_1_df, train_fr_2_df, train_fr_3_df, train_fr_4_df, train_fr_5_df], axis=1)
train_fr_df2 = pd.concat([train_fr_1_df2, train_fr_2_df2, train_fr_3_df2, train_fr_4_df2, train_fr_5_df2], axis=0)

In [85]:
train_id_1_df = preprocessing_file(path='../data/Language/train/id-11.txt', colname='train_id_1')
train_id_2_df = preprocessing_file(path='../data/Language/train/id-12.txt', colname='train_id_2')
train_id_3_df = preprocessing_file(path='../data/Language/train/id-13.txt', colname='train_id_3')
train_id_4_df = preprocessing_file(path='../data/Language/train/id-14.txt', colname='train_id_4')
train_id_5_df = preprocessing_file(path='../data/Language/train/id-15.txt', colname='train_id_5')

train_id_1_df2 = preprocessing_file(path='../data/Language/train/id-11.txt', colname='train_id')
train_id_2_df2 = preprocessing_file(path='../data/Language/train/id-12.txt', colname='train_id')
train_id_3_df2 = preprocessing_file(path='../data/Language/train/id-13.txt', colname='train_id')
train_id_4_df2 = preprocessing_file(path='../data/Language/train/id-14.txt', colname='train_id')
train_id_5_df2 = preprocessing_file(path='../data/Language/train/id-15.txt', colname='train_id')

train_id_df = pd.concat([train_id_1_df, train_id_2_df, train_id_3_df, train_id_4_df, train_id_5_df], axis=1)
train_id_df2 = pd.concat([train_id_1_df2, train_id_2_df2, train_id_3_df2, train_id_4_df2, train_id_5_df2], axis=0)

In [86]:
train_tl_1_df = preprocessing_file(path='../data/Language/train/tl-16.txt', colname='train_tl_1')
train_tl_2_df = preprocessing_file(path='../data/Language/train/tl-17.txt', colname='train_tl_2')
train_tl_3_df = preprocessing_file(path='../data/Language/train/tl-18.txt', colname='train_tl_3')
train_tl_4_df = preprocessing_file(path='../data/Language/train/tl-19.txt', colname='train_tl_4')
train_tl_5_df = preprocessing_file(path='../data/Language/train/tl-20.txt', colname='train_tl_5')

train_tl_1_df2 = preprocessing_file(path='../data/Language/train/tl-16.txt', colname='train_tl')
train_tl_2_df2 = preprocessing_file(path='../data/Language/train/tl-17.txt', colname='train_tl')
train_tl_3_df2 = preprocessing_file(path='../data/Language/train/tl-18.txt', colname='train_tl')
train_tl_4_df2 = preprocessing_file(path='../data/Language/train/tl-19.txt', colname='train_tl')
train_tl_5_df2 = preprocessing_file(path='../data/Language/train/tl-20.txt', colname='train_tl')

train_tl_df = pd.concat([train_tl_1_df, train_tl_2_df, train_tl_3_df, train_tl_4_df, train_tl_5_df], axis=1)
train_tl_df2 = pd.concat([train_tl_1_df2, train_tl_2_df2, train_tl_3_df2, train_tl_4_df2, train_tl_5_df2], axis=0)

In [87]:
test_en_1_df = preprocessing_file(path='../data/Language/test/en-1.txt', colname='test_en_1')
test_en_2_df = preprocessing_file(path='../data/Language/test/en-2.txt', colname='test_en_2')

test_en_1_df2 = preprocessing_file(path='../data/Language/test/en-1.txt', colname='test_en')
test_en_2_df2 = preprocessing_file(path='../data/Language/test/en-2.txt', colname='test_en')

test_en_df = pd.concat([test_en_1_df, test_en_2_df], axis=1)
test_en_df2 = pd.concat([test_en_1_df2, test_en_2_df2], axis=0)

In [88]:
test_fr_1_df = preprocessing_file(path='../data/Language/test/fr-3.txt', colname='test_fr_1')
test_fr_2_df = preprocessing_file(path='../data/Language/test/fr-4.txt', colname='test_fr_2')

test_fr_1_df2 = preprocessing_file(path='../data/Language/test/fr-3.txt', colname='test_fr')
test_fr_2_df2 = preprocessing_file(path='../data/Language/test/fr-4.txt', colname='test_fr')

test_fr_df = pd.concat([test_fr_1_df, test_fr_2_df], axis=1)
test_fr_df2 = pd.concat([test_fr_1_df2, test_fr_2_df2], axis=0)

In [89]:
test_id_1_df = preprocessing_file(path='../data/Language/test/id-5.txt', colname='test_id_1')
test_id_2_df = preprocessing_file(path='../data/Language/test/id-6.txt', colname='test_id_2')

test_id_1_df2 = preprocessing_file(path='../data/Language/test/id-5.txt', colname='test_id')
test_id_2_df2 = preprocessing_file(path='../data/Language/test/id-6.txt', colname='test_id')

test_id_df = pd.concat([test_id_1_df, test_id_2_df], axis=1)
test_id_df2 = pd.concat([test_id_1_df2, test_id_2_df2], axis=0)

In [90]:
test_tl_1_df = preprocessing_file(path='../data/Language/test/tl-7.txt', colname='test_tl_1')
test_tl_2_df = preprocessing_file(path='../data/Language/test/tl-8.txt', colname='test_tl_2')

test_tl_1_df2 = preprocessing_file(path='../data/Language/test/tl-7.txt', colname='test_tl')
test_tl_2_df2 = preprocessing_file(path='../data/Language/test/tl-8.txt', colname='test_tl')

test_tl_df = pd.concat([test_tl_1_df, test_tl_2_df], axis=1)
test_tl_df2 = pd.concat([test_tl_1_df2, test_tl_2_df2], axis=0)

In [91]:
# DF save
train_en_df.to_csv('../data/Language/train_en_df.csv', encoding='utf-8', index=False)
train_fr_df.to_csv('../data/Language/train_fr_df.csv', encoding='utf-8', index=False)
train_id_df.to_csv('../data/Language/train_id_df.csv', encoding='utf-8', index=False)
train_tl_df.to_csv('../data/Language/train_tl_df.csv', encoding='utf-8', index=False)

test_en_df.to_csv('../data/Language/test_en_df.csv', encoding='utf-8', index=False)
test_fr_df.to_csv('../data/Language/test_fr_df.csv', encoding='utf-8', index=False)
test_id_df.to_csv('../data/Language/test_id_df.csv', encoding='utf-8', index=False)
test_tl_df.to_csv('../data/Language/test_tl_df.csv', encoding='utf-8', index=False)

In [92]:
# DF save
train_en_df2.to_csv('../data/Language/train_en_df2.csv', encoding='utf-8', index=False, )
train_fr_df2.to_csv('../data/Language/train_fr_df2.csv', encoding='utf-8', index=False)
train_id_df2.to_csv('../data/Language/train_id_df2.csv', encoding='utf-8', index=False)
train_tl_df2.to_csv('../data/Language/train_tl_df2.csv', encoding='utf-8', index=False)

test_en_df2.to_csv('../data/Language/test_en_df2.csv', encoding='utf-8', index=False)
test_fr_df2.to_csv('../data/Language/test_fr_df2.csv', encoding='utf-8', index=False)
test_id_df2.to_csv('../data/Language/test_id_df2.csv', encoding='utf-8', index=False)
test_tl_df2.to_csv('../data/Language/test_tl_df2.csv', encoding='utf-8', index=False)

In [93]:
train_en_1_df2.tail()

Unnamed: 0,train_en
170,oakland
171,coordinates
172,related
173,article
174,stub
