## Data processing - Machine Learning

###

#### Word Frequency Counter
Given a text string, return the top-k most frequent words.
- Ignore case (treat "Apple" and "apple" the same).
- Ignore punctuation (",", ".", "!", etc. should be stripped out).
- If two words have the same frequency, return them in lexicographical order.

Example:
- text = "Hello world! Hello, Machine Learning world."
- k = 2

Output: ["hello", "world"]

In [1]:
from typing import List
import re

In [None]:
from collections import Counter
def process_text(text: str, k, n_gram) -> List[str]:
    text = text.lower()
    text = re.sub(r"[#!.,&%$@~+\-^():;?',\"\s+]", " ", text)
    words = text.strip().split()
    n_grams = [' '.join(words[i-n_gram:i]) for i in range(n_gram, len(words))]
    word_counter = Counter(words)
    n_grams_counter = Counter(n_grams)
    top_k_words = sorted(word_counter.items(), key=lambda x:(-x[1], x[0]))[:k]
    top_k_n_grams = sorted(n_grams_counter.items(), key=lambda x:(-x[1], x[0]))[:k]
    return [k for k,_ in top_k_words], [k for k,_ in top_k_n_grams]


text = "Hello World! Hello, Machine Learning World - I'm coming for you. I am job hunting"
print(process_text(text, 3, 2))

text = "to be or not to be, that is the question. To be or not?"
print(process_text(text, 3, 2))


(['hello', 'i', 'world'], ['am job', 'coming for', 'for you'])
(['be', 'to', 'not'], ['to be', 'be or', 'be that'])


In [2]:
%%bash

$(which python3) -m pip install pandas

Defaulting to user installation because normal site-packages is not writeable



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.2[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m


In [6]:
from ast import literal_eval
import pandas as pd

In [7]:
def read_data(filename):
    data = pd.read_csv(filename, sep='\t')
    data['tags'] = data['tags'].apply(literal_eval)
    return data

train = read_data('train.tsv')

In [10]:
train

Unnamed: 0,title,tags
0,How to draw a stacked dotplot in R?,[r]
1,mysql select all records where a datetime fiel...,"[php, mysql]"
2,How to terminate windows phone 8.1 app,[c#]
3,get current time in a specific country via jquery,"[javascript, jquery]"
4,Configuring Tomcat to Use SSL,[java]
...,...,...
99995,"Obj-c, incorrect checksum for freed object - o...","[iphone, objective-c, ios, cocoa-touch]"
99996,How to connect via HTTPS using Jsoup?,"[java, android]"
99997,Python Pandas Series of Datetimes to Seconds S...,"[python, datetime, pandas]"
99998,jqGrid issue grouping - Duplicate rows get app...,"[javascript, jquery]"
