# Preprocessing

## Clean your text data with clean-text

Content on the Web and in Social Media is never clean.

`clean-text` does the Preprocessing for you.

You can specify, if and how you want to clean your texts.

In [None]:
!pip install clean-text[gpl]

In [None]:
from cleantext import clean

text = '''
       If you want to talk, send me an email: testmail@outlook.com, 
       call me +71112392 or visit my website: https://testurl.com. 
       Calling me is not free, It'\\u2018s\\u2019 costing 0.40$ per 
       minute.
       '''

clean(text,
    fix_unicode=True,              # fix various unicode errors
    to_ascii=True,                 # transliterate to closest ASCII representation
    lower=True,                    # lowercase text
    no_urls=True,                  # replace all URLs with a special token
    no_emails=True,                # replace all email addresses with a special token
    no_phone_numbers=True,         # replace all phone numbers with a special token
    no_numbers=True,               # replace all numbers with a special token
    no_digits=True,                # replace all digits with a special token
    no_currency_symbols=True,      # replace all currency symbols with a special token
    no_punct=True,                 # remove punctuations
    lang="en"                      # set to 'de' for German special handling
)

## Detect and Fix your Data Quality Issues

Do you want to detect data quality issues?

Try `pandas_dq`.

`pandas_dq` is a relatively new library, focussing on detecting data quality issues and fixing them automatically like:

- Zero-Variance Columns
- Rare Categories
- Highly correlated Features
- Skewed Distributions

In [None]:
!pip install pandas_dq -q

In [None]:
import pandas as pd
import numpy as np
from pandas_dq import dq_report, Fix_DQ
from sklearn.datasets import load_iris

In [None]:
data = load_iris()

In [None]:
data = pd.DataFrame(data= np.c_[data['data'], data['target']],
                     columns= data['feature_names'] + ['target'])

In [None]:
dq_report(data, verbose=1)

In [None]:
fdq = Fix_DQ()
data_transformed = fdq.fit_transform(data)