# Examples of functions:
- split_into_sentences
- language_filtering
- jaccard_sim_filtering
- perplexity_filtering 

## Load package and dependencies

In [1]:
import textcl
import pandas as pd

## Prepare input data from modified the BBC dataset

Load the text data you want to process. It's necessary to have column `text` in the data (default name). If you don't have `text` column you will need to specify the name for **split_into_sentences** function using `text_col` parameter. Source file from this example structured as follows:

In [2]:
SOURCE_FILE_PATH = 'prepared_bbc_dataset.csv'

# getting text data from file
input_texts_df = pd.read_csv(SOURCE_FILE_PATH).reset_index()
input_texts_df

Unnamed: 0,index,topic_name,text
0,0,business,WorldCom bosses' $54m payout Ten former direc...
1,1,business,Profits slide at India's Dr Reddy Profits at ...
2,2,business,Liberian economy starts to grow The Liberian ...
3,3,business,"Uluslararası Para Fonu (IMF), Liberya ekonomis..."
4,4,entertainment,Singer Ian Brown 'in gig arrest' Former Stone...
5,5,entertainment,Blue beat U2 to top France honour Irish band ...
6,6,entertainment,Housewives lift Channel 4 ratings The debut o...
7,7,entertainment,Домохозяйки подняли рейтинги канала 4 Дебют ам...
8,8,entertainment,Housewives Channel 4 reytinglerini yükseltti A...
9,9,politics,Observers to monitor UK election Ministers wi...


## Split texts into sentences

If `sentence_col` is not specified as a parameter, created sentences will be saved in the `sentence` column.|

In [3]:
split_input_texts_df = textcl.split_into_sentences(input_texts_df)
print("Num sentences before filtering: {}".format(len(split_input_texts_df)))

Num sentences before filtering: 319


In [4]:
split_input_texts_df.head()

Unnamed: 0,index,topic_name,text,sentence
0,0,business,WorldCom bosses' $54m payout Ten former direc...,WorldCom bosses' $54m payout Ten former direc...
1,0,business,WorldCom bosses' $54m payout Ten former direc...,"James Wareham, a lawyer representing one of t..."
2,0,business,WorldCom bosses' $54m payout Ten former direc...,The remaining $36m will be paid by the directo...
3,0,business,WorldCom bosses' $54m payout Ten former direc...,"But, a spokesman for the prosecutor, New York ..."
4,0,business,WorldCom bosses' $54m payout Ten former direc...,Corporate governance experts said that if the...


## Filtering on language

In [5]:
split_input_texts_df = textcl.language_filtering(split_input_texts_df, threshold=0.99, language='en')
print("Num sentences after language filtering: {}".format(len(split_input_texts_df)))

Num sentences after language filtering: 279


In [6]:
split_input_texts_df.head()

Unnamed: 0,index,topic_name,text,sentence
0,0,business,WorldCom bosses' $54m payout Ten former direc...,WorldCom bosses' $54m payout Ten former direc...
1,0,business,WorldCom bosses' $54m payout Ten former direc...,"James Wareham, a lawyer representing one of t..."
2,0,business,WorldCom bosses' $54m payout Ten former direc...,The remaining $36m will be paid by the directo...
3,0,business,WorldCom bosses' $54m payout Ten former direc...,"But, a spokesman for the prosecutor, New York ..."
4,0,business,WorldCom bosses' $54m payout Ten former direc...,Corporate governance experts said that if the...


Join sentences to the texts to review the results

In [7]:
textcl.join_sentences_by_label(split_input_texts_df, label_col = 'index')

Unnamed: 0,index,sentence
0,0,WorldCom bosses' $54m payout Ten former direc...
1,1,Profits slide at India's Dr Reddy Profits at ...
2,2,Liberian economy starts to grow The Liberian ...
3,4,Singer Ian Brown 'in gig arrest' Former Stone...
4,5,Blue beat U2 to top France honour Irish band ...
5,6,Housewives lift Channel 4 ratings The debut o...
6,9,Observers to monitor UK election Ministers wi...
7,10,Lib Dems highlight problem debt People vulner...
8,11,Minister defends hunting ban law The law bann...
9,12,Legendary Dutch boss Michels dies Legendary D...


As we can see texts with index 3 (Turkish), 7 (Russian), 8 (Turkish) were removed.

## Filtering on Jaccard similarity

In [8]:
split_input_texts_df = textcl.jaccard_sim_filtering(split_input_texts_df, threshold=0.8)
print("Num sentences after Jaccard sim filtering: {}".format(len(split_input_texts_df)))

Num sentences after Jaccard sim filtering: 256


In [9]:
split_input_texts_df.head()

Unnamed: 0,index,topic_name,text,sentence
0,0,business,WorldCom bosses' $54m payout Ten former direc...,WorldCom bosses' $54m payout Ten former direc...
1,0,business,WorldCom bosses' $54m payout Ten former direc...,"James Wareham, a lawyer representing one of t..."
2,0,business,WorldCom bosses' $54m payout Ten former direc...,The remaining $36m will be paid by the directo...
3,0,business,WorldCom bosses' $54m payout Ten former direc...,"But, a spokesman for the prosecutor, New York ..."
4,0,business,WorldCom bosses' $54m payout Ten former direc...,Corporate governance experts said that if the...


Join sentences to the texts to review the results

In [10]:
textcl.join_sentences_by_label(split_input_texts_df, label_col = 'index')

Unnamed: 0,index,sentence
0,0,WorldCom bosses' $54m payout Ten former direc...
1,1,Profits slide at India's Dr Reddy Profits at ...
2,2,Liberian economy starts to grow The Liberian ...
3,4,Singer Ian Brown 'in gig arrest' Former Stone...
4,5,Blue beat U2 to top France honour Irish band ...
5,6,Housewives lift Channel 4 ratings The debut o...
6,9,Observers to monitor UK election Ministers wi...
7,10,Lib Dems highlight problem debt People vulner...
8,11,Minister defends hunting ban law The law bann...
9,12,Legendary Dutch boss Michels dies Legendary D...


Texts with id=17 was removed as it partially duplicates text with id=18.

## Filtering on perplexity score

In [None]:
split_input_texts_df = textcl.perplexity_filtering(split_input_texts_df, threshold=5)
print("Num sentences after perplexity filtering: {}".format(len(split_input_texts_df)))

In [None]:
split_input_texts_df.head()

Join sentences to the texts to review the results

In [None]:
textcl.join_sentences_by_label(split_input_texts_df, label_col = 'index')

Texts with id=19 was removed because sentence `data clear additional 78.0 long-term 43 those)` is not linguistically correct.

## Outliers filtering

Join sentences to the text after filtering and select category **tech**. In this category was manually inserted an outlier with person profile instead of tech text

In [None]:
joined_texts = split_input_texts_df[["index", "text", "topic_name"]].drop_duplicates()
joined_texts = joined_texts[joined_texts.topic_name == 'tech']

In [None]:
joined_texts, _ = textcl.outlier_detection(joined_texts, method='rpca', Z_threshold=0.8)
print("Num sentences after outliers filtering: {}".format(len(input_texts_df)))

In [None]:
joined_texts

Texts with id=20 was removed because it describes a person profile instead of tech news.