# Data Analysis

* [Typo 錯別字](#typo)
* [Irregular character 不規範字符](#irregular-character)

> * [rapidsai/cudf: cuDF - GPU DataFrame Library](https://github.com/rapidsai/cudf)
> * [Numba: A High Performance Python Compiler](https://numba.pydata.org/)

In [None]:
# Just run at the first time
# !python3 -m pip install pypinyin # dependency
# !python3 -m pip install pycorrector

In [None]:
# ignore warning in jupyter notebook
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

In [None]:
# Helper Function

data_file = {
    "Ant": "raw_data/competition_train.csv",
    "CCSK": "raw_data/task3_train.txt"
}

pd_common_param = {
    "delimiter": "\t",
    "names": ["sentence1", "sentence2", "label"]
}

# TODO: remember to remove .head()
def load_sentences(dataset):
    if dataset == "Ant":
        data = pd.read_csv(data_file[dataset], index_col=0, **pd_common_param)
    elif dataset == "CCSK":
        data = pd.read_csv(data_file[dataset], **pd_common_param)
    
    sentences = list(set(data['sentence1'].to_list() + data['sentence2'].to_list()))
    
    return sentences
    

## Typo & Irregular character using `pycorrector`

> Resources
> * [自然語言處理-錯字識別（基於Python）kenlm、pycorrector](https://cloud.tencent.com/developer/article/1387643)
> * [shibing624/pycorrector: pycorrector is a toolkit for text error correction. It was developed to facilitate the designing, comparing, and sharing of deep text error correction models.](https://github.com/shibing624/pycorrector)

In [None]:
import pycorrector
# ignore warning messages of INFO and DEBUG
pycorrector.set_log_level('WARN')
import multiprocessing as mp
import numpy as np

In [None]:
# Load sentences
ant_sentences = load_sentences("Ant")
ccsk_sentences = load_sentences("CCSK")

class TypoCounter:
    def __init__(self, sentences):
        self.ss = sentences
        self.n = len(sentences)
        self.corrector = pycorrector.corrector
        self.results = []
        
    def _count_i(self, i):
        # count correctness on single sentence of index i
        return len(self.corrector.correct(self.ss[i])[1])
   
    def _collect_result(self, result):
        self.results.append(result)

    def count_incorrect_chars(self):
        pool = mp.Pool(mp.cpu_count())
        result_objs = [pool.apply_async(self._count_i, args=(i, )) for i in range(self.n)]
        pool.close()
        return [r.get() for r in result_objs]


In [None]:
AntCounter = TypoCounter(ant_sentences)

print("Incorrect characters count in Ant:", np.sum(AntCounter.count_incorrect_chars()))

In [None]:
CCSKCounter = TypoCounter(ccsk_sentences)

print("Incorrect characters count in CCSK:", np.sum(CCSKCounter.count_incorrect_chars()))

## Appendix

### Multithreading / Parallel

* `multiprocessing`
    * [**Parallel Processing in Python - A Practical Guide with Examples | ML+**](https://www.machinelearningplus.com/python/parallel-processing-python/)
    * [parallel processing - How do I parallelize a simple Python loop? - Stack Overflow](https://stackoverflow.com/questions/9786102/how-do-i-parallelize-a-simple-python-loop)
* `Numba Jit`
    * [1.10. Automatic parallelization with @jit — Numba 0.46.0.dev0+566.g5bd018fd5.dirty-py3.6-macosx-10.7-x86_64.egg documentation](https://numba.pydata.org/numba-doc/latest/user/parallel.html)
    * [Python · numba 的基本應用 - 知乎](https://zhuanlan.zhihu.com/p/27152060)

### Logging Level

* [Understanding logging levels](https://www.ibm.com/support/knowledgecenter/en/SSEP7J_10.2.2/com.ibm.swg.ba.cognos.ug_rtm_wb.10.2.2.doc/c_n30e74.html)
* [python - Hide all warnings in ipython - Stack Overflow](https://stackoverflow.com/questions/9031783/hide-all-warnings-in-ipython)
