<span>
<img src="./logo.png" width="100px" align="right"/>
</span>
<span>
<b>Author:</b> <a href="https://github.com/andreafailla">Andrea Failla</a><br/>
<b>Python version:</b>  >=3<br/>
<b>Last update:</b> 08/08/2021<br>
<b>Icon::</b> Smashicons
</span>

<a id='top'></a>
# PreTwITA
<code><b>PreTwITA</b></code> is an open source <b>Pre</b>processor for <b>Tw</b>eets in the <b>ITA</b>lian language written in Python. The purpose of such library is to provide the user with language-specific tools for text cleaning (i.e. the process of preparing raw text for Natural Language Processing). This notebook illustrates the main features of <code>PreTwITA</code>

First, let's import the module

In [1]:
from pretwita import PreTwITA as PTW
from pretwita.patterns import get_hashtags

Then, we create an instance of a <code>PTW</code> object and feed it with our text

In [2]:
tweet = 'Questo Ã¨ un tweet di prova ðŸ˜€ðŸ˜€ðŸ˜€ @unipisa'

In [3]:
ptw = PTW(tweet)
ptw

<pretwita.pretwita.PreTwITA at 0x7f9a474a6370>

### Functions ([to top](#top))

<code>PreTwITA</code> comes with an <code>available_functions()</code> method for quick reference. Its output lists the functions along with their default parameters

In [4]:
PTW.available_functions()

clean(placeholder=False, additional_stopwords=None, keep_dates=False)
to_lower()
correct_abbreviations()
remove_urls(placeholder=False)
remove_emojis(placeholder=False)
remove_emoticons(placeholder=False)
remove_mentions(placeholder=False)
remove_hashtags(placeholder=False)
remove_reserved_words(placeholder=False)
remove_stopwords(additional_stopwords=None)
remove_punctuation()
remove_numbers(keep_dates=False)
remove_multiple_spaces()
get_tokens()


To remove unwanted elements, just call the corresponding method. When <code>placeholder</code> is <code>True</code>, such elements are replaced with dummy tokens instead of being removed

In [5]:
ptw.remove_mentions()
print(ptw.text)

Questo Ã¨ un tweet di prova ðŸ˜€ðŸ˜€ðŸ˜€ 


In [6]:
ptw.remove_emojis(placeholder=True)
print(ptw.text)

Questo Ã¨ un tweet di prova xxEMOJIxx xxEMOJIxx xxEMOJIxx  


The <code>correct_abbreviations()</code> function replaces common italian abbreviations (often used in on-line contexts) with the corresponding full expression. To increase accuracy, it would be better to first convert the text to lowercase using <code>to_lower()</code>

In [7]:
tweet = "Nn so xk xÃ² vbn"

In [8]:
ptw = PTW(tweet)
ptw.to_lower().correct_abbreviations()
ptw.text

'non so perchÃ© perÃ² va bene'

The function <code>remove_stopwords()</code> removes italian stopwords as in <code>nltk.corpus.stopwords</code>. However, since these are hard-coded, there is no need of having <code>nltk</code> installed. The function takes an optional parameter <code>additional_stopwords</code> that lets the user define more stopwords in a <code>list</code>

In [9]:
tweet = "Non so perchÃ© perÃ² questo tweet va bene"

In [10]:
ptw = PTW(tweet)
ptw.to_lower().remove_stopwords()
ptw.text

'so perÃ² tweet va bene'

In [11]:
stopwords = ['so', 'perÃ²', 'va']
ptw.to_lower().remove_stopwords(additional_stopwords=stopwords)
ptw.text

'tweet bene'

The function <code>remove_numbers()</code> takes an optional parameter <code>keep_dates</code>  that does not remove dates in the format <i>yyyy</i> when <code>True</code>

In [12]:
tweet = "Nel 2012 avevo 3 cani, 2 gatti e 8 porcellini d'india"
ptw = PTW(tweet)

In [13]:
ptw.remove_numbers(keep_dates=True)
ptw.text

"Nel 2012 avevo cani, gatti e porcellini d'india"

In some cases, PreTwITA functions leave a blank space when removing elements from tweets. This is a safety measure to ensure that each and every word remains separated from its neighbours.  However, in case you do not want these additional spaces, you can call <code>remove_multiple_spaces()</code>

In [14]:
tweet = "Ho    bisogno dei     miei  spazi"
ptw = PTW(tweet)

In [15]:
ptw.remove_multiple_spaces()
ptw.text

'Ho bisogno dei miei spazi'

### Building a pipeline ([to top](#top))

A preprocessing pipeline can be built concatenating functions as follows

In [16]:
tweet = """Questo testo serve per testare le funzioni di pretwita, che sono addirittura piÃ¹ di 2!
Rilancia questo tweet con un RT grz tvb ðŸ˜€ðŸ˜€ðŸ˜€
ChissÃ  cosa ne penseranno a @Unipisa... #nlp"""
ptw = PTW(tweet)

In [17]:
ptw.to_lower().remove_mentions() \
.remove_hashtags().remove_reserved_words() \
.remove_stopwords().remove_multiple_spaces()

<pretwita.pretwita.PreTwITA at 0x7f9a474e59d0>

In [18]:
ptw.text

'testo serve testare funzioni pretwita, addirittura 2! rilancia tweet grz tvb ðŸ˜€ðŸ˜€ðŸ˜€ chissÃ  cosa penseranno ...'

You can also call the <code>clean()</code> function to run a full cleaning pipeline

In [19]:
tweet = """Questo testo serve per testare le funzioni di pretwita, che sono addirittura piÃ¹ di 2!
Rilancia questo tweet con un RT grz tvb ðŸ˜€ðŸ˜€ðŸ˜€
ChissÃ  cosa ne penseranno a @Unipisa... #nlp"""
ptw = PTW(tweet)

In [20]:
ptw.clean().text

'testo serve testare funzioni pretwita addirittura rilancia tweet grazie voglio bene chissÃ  cosa penseranno'

In [21]:
from IPython.display import HTML
import requests

def show_tweet(link):
    """Display the contents of a tweet"""
    url = 'https://publish.twitter.com/oembed?url=%s' % link
    response = requests.get(url)
    html = response.json()["html"]
    display(tweet_link)
    display(HTML(html))
    
tweet_link = "https://twitter.com/matteosalvinimi/status/1423957002540228621"
show_tweet(tweet_link)

'https://twitter.com/matteosalvinimi/status/1423957002540228621'