# Preprocessing Requirements
## RECLASS

**Purpose**: Verify the output datasets.

- `uber_reviews_cleaned.csv` - Full cleaned dataset
- `uber_reviews_sampled.csv` - tbc
- `uber_reviews_tagged.csv` - tbc ...Annotation progress

**Dataset**: Uber Customer Reviews from Google Play (Kaggle)

---

In [1]:
import pandas as pd
import numpy as np
import re
from langdetect import detect, LangDetectException



In [2]:
df = pd.read_csv('../data/uber_reviews_cleaned.csv')
df.head()

Unnamed: 0,review,rating,word_count
0,"suddenly, the driver can't have my location an...",1,23
1,very cordial. and helped with a quick turnarou...,5,11
2,i have enjoyed traveling by uber my drivers ha...,5,23
3,app is good but main problem is the drivers ca...,1,23
4,very bad experience no customer service,1,6


In [3]:
(df['word_count'] > 100).sum()

np.int64(6740)

In [4]:
print(f"Max length review:\n{df.loc[df['word_count'].idxmax(), 'review']}")

Max length review:
i've been using uber for a few years now and for the most part haven't had any problems the only problem i have ever had that was never resolved or explained to me was when i was refunded cancellation fee because the driver pulled into me and then literally looked in my face and drove away and canceled the ride.buy still charges me.a cancellation fee the problem wasn't being charged the fee because i was reimbursed a few minutes later the problem was that they gave me a $5 uber credit and i don't have a debit card so i use uber gift cards so i had a balance on the gift card of roughly $4.85 and is an uber credit of $5 but for some reason you can't combine them to use on a single gel ride and there is a minimum of i think $6 or $7 for a ride so i was never able to use the money i was reimbursed and the remaing balance on my gift card was not enough for a full ride and that was 4 months ago and still have not been able to get a response as to how i can use the funds re

In [5]:
def has_spam_repetition(text):
    return bool(re.search(r'(.)\1{4,}', str(text)))
df['review'].apply(has_spam_repetition).sum()

np.int64(2839)

In [6]:
sample_check = df[df['word_count'] >= 10].sample(1000)

In [7]:
def detect_language(text):
    """Detect language of text"""
    try:
        if pd.isna(text) or len(str(text).strip()) < 10:
            return 'unknown'
        return detect(str(text))
    except LangDetectException:
        return 'unknown'

In [8]:
sample_check['lang'] = sample_check['review'].apply(detect_language)

In [9]:
print(f"\nLanguage distribution in 10+ word reviews:")
print(sample_check['lang'].value_counts())


Language distribution in 10+ word reviews:
lang
en    956
es      7
pt      7
ar      4
id      3
da      3
bn      3
af      3
it      2
tl      2
tr      2
fr      1
ro      1
et      1
sv      1
nl      1
hi      1
pl      1
ta      1
Name: count, dtype: int64
