## TextHero

Under the hoods, Texthero makes use of multiple NLP and machine learning toolkits such as Gensim, NLTK, SpaCy and scikit-learn. You don't need to install them all separately, pip will take care of that.

Texthero include tools for:

- Preprocess text data: it offers both out-of-the-box solutions but it's also flexible for custom-solutions.
- Natural Language Processing: keyphrases and keywords extraction, and named entity recognition.
- Text representation: TF-IDF, term frequency, and custom word-embeddings (wip)
- Vector space analysis: clustering (K-means, Meanshift, DBSAN and Hierarchical), topic modelling (wip) and interpretation.
- Text visualization: vector space visualization, place localization on maps (wip).

Supported representation algorithms:

- Term frequency (count)
- Term frequency-inverse document frequency (tfidf)
*********************************
Supported clustering algorithms:
*********************************
- K-means (kmeans)
- Density-Based Spatial Clustering of Applications with Noise (dbscan)
- Meanshift (meanshift)
********************************
Supported dimensionality reduction algorithms:
*******************************

- Principal component analysis (pca)
- t-distributed stochastic neighbor embedding (tsne)
- Non-negative matrix factorization (nmf)

In [118]:
import warnings
warnings.filterwarnings("ignore")

In [50]:
!pip install texthero



In [137]:
import texthero
help(texthero)

Help on package texthero:

NAME
    texthero - Texthero: python toolkit for text preprocessing, representation and visualization.

PACKAGE CONTENTS
    __about__
    clustering
    preprocessing
    representation
    statistics
    texthero
    version
    visualization

DATA
    stopwords = <WordListCorpusReader in 'C:\\Users\\Abhra\\AppData\\Roami...

VERSION
    1.0.5

FILE
    c:\users\abhra\anaconda3\lib\site-packages\texthero\__init__.py




#### Text Preprocessing

In [140]:
import pandas as pd
text="Backgammon is one of the oldest known board games. Its history can be traced back nearly 5,000 years to archeological discoveries in the Middle East. It is a two player game where each player has (fifteen) checkers which move between 24 points according to the roll of two dice."
series=pd.Series(text)

In [141]:
series

0    Backgammon is one of the oldest known board ga...
dtype: object

In [142]:
import texthero as hero

hero.remove_digits(series)

0    Backgammon is one of the oldest known board ga...
dtype: object

In [143]:
#### Remove punctuations
hero.remove_punctuation(series)

0    Backgammon is one of the oldest known board ga...
dtype: object

In [144]:
#### Remove Brackets
hero.remove_brackets(series)

0    Backgammon is one of the oldest known board ga...
dtype: object

In [145]:
hero.remove_diacritics(series)

0    Backgammon is one of the oldest known board ga...
dtype: object

In [146]:
hero.remove_whitespace(series)

0    Backgammon is one of the oldest known board ga...
dtype: object

In [147]:
### Stopwords
hero.remove_stop_words(series)

0    Backgammon  one   oldest known board games. It...
dtype: object

In [148]:
hero.clean(series)

0    backgammon one oldest known board games histor...
dtype: object

In [149]:
df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)
df.head()

Unnamed: 0,text,topic
0,Claxton hunting first major medal\n\nBritish h...,athletics
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics
2,Greene sets sights on world title\n\nMaurice G...,athletics
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics


In [150]:
###PCA
import texthero as hero
import pandas as pd

df = pd.read_csv(
   "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['pca'] = (
   df['text']
   .pipe(hero.clean)
   .pipe(hero.do_tfidf)#vectorizing
   .pipe(hero.do_pca)
)
hero.scatterplot(df, 'pca', color='topic', title="PCA BBC Sport news")

In [151]:
df

Unnamed: 0,text,topic,pca
0,Claxton hunting first major medal\n\nBritish h...,athletics,"[0.021837064492021573, -0.25498467778003636]"
1,O'Sullivan could run in Worlds\n\nSonia O'Sull...,athletics,"[-0.1078507980789952, 0.012807528242805755]"
2,Greene sets sights on world title\n\nMaurice G...,athletics,"[0.04752140853536325, -0.24468334905976996]"
3,IAAF launches fight against drugs\n\nThe IAAF ...,athletics,"[0.025562860091105987, -0.2283714483856848]"
4,"Dibaba breaks 5,000m world record\n\nEthiopia'...",athletics,"[-0.018278790528479666, -0.21534703599357619]"
...,...,...,...
732,Agassi into second round in Dubai\n\nFourth se...,tennis,"[-0.03931431828575104, -0.23077802100878952]"
733,Mauresmo fights back to win title\n\nWorld num...,tennis,"[-0.04133748439698744, -0.3684613853914394]"
734,Federer wins title in Rotterdam\n\nWorld numbe...,tennis,"[-0.055367724732986026, -0.2622211584437157]"
735,GB players warned over security\n\nBritain's D...,tennis,"[0.11174055041341721, 0.030208444921598243]"


In [152]:
import texthero as hero
import pandas as pd

df = pd.read_csv(
    "https://github.com/jbesomi/texthero/raw/master/dataset/bbcsport.csv"
)

df['tfidf'] = (
    df['text']
    .pipe(hero.clean)
    .pipe(hero.do_tfidf)
)
### Kmeans

df['kmeans_labels'] = (
    df['tfidf']
    .pipe(hero.do_kmeans, n_clusters=5)
    .astype(str)
)

df['pca'] = df['tfidf'].pipe(hero.do_pca)

hero.scatterplot(df, 'pca', color='kmeans_labels', title="K-means BBC Sport news")