# Data Science Design Manual Chapter 3 Exercises

This chapter focuses on "Data Munging" (AKA "Data Wrangling"), i.e. obtaining data from some relevant source, and transforming it into a dataset with some nontrivial structure in order to analyze it using a suite of technologies (e.g. particular programming languages, analytical paradigms, applications).

## Implementation Projects

---

### 3-10

*Implement a function that extracts the set of hashtags from a data frame of tweets. Hashtags begin with the “#” character and contain any combination of upper and lowercase characters and digits. Assume the hashtag ends where there is a space or a punctuation mark, like a comma, semicolon, or period.*


Starting from first principles, we need to figure out a way to import tweets (find a source and request the tweets) and represent them in dataframe format.

For our purposes, we may suppose that the tweets are simply strings consisting of 280 characters or less, and dataframes are arrays (no need to use pandas right now...).

In [54]:
class TweetFrame:
    def __init__(self, tweets):
        self.tweets = tweets
        
    global hashtagApproved 
    hashtagApproved = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
        
    def getHashtags(self):
        hashtags = set()
        hashtag='#'
        for tweet in self.tweets:
            count=1
            partOfHashtag = False
            for char in tweet:
                if(partOfHashtag == False):
                    if(char == '#'):
                        partOfHashtag = True
                elif(partOfHashtag == True and count != len(tweet)):
                    if(char in hashtagApproved):
                        hashtag += char
                    elif(char=='#'):
                        hashtag = hashtag.lower()
                        if(hashtag!='#'):
                            hashtags.add(hashtag)
                        partOfHashtag = True
                        hashtag='#'
                    else:
                        partOfHashtag = False
                        # since hashtags are not case-sensitive, we will normalize the hashtags as lower-case in order to avoid redundancy
                        hashtag = hashtag.lower()
                        if(hashtag != '#'):
                            hashtags.add(hashtag)
                        hashtag = '#'
                else:
                    partOfHashtag = False
                    hashtag+=char
                    # since hashtags are not case-sensitive, we will normalize the hashtags as lower-case in order to avoid redundancy
                    hashtag = hashtag.lower()
                    if(hashtag != '#'):
                        hashtags.add(hashtag)
                    hashtag = '#'
                count+=1
        return hashtags

We should now have a function that takes in an \[ iterable \] array of tweets and returns a set consisting of all the hashtags in the tweet. Let's test it:

In [55]:
tweetList=["wow, I can't believe how lit #Jupyter notebooks are #JupyterNotebooks#Python", 
        "#IPython notebooks are #awesomesauce, can't believe I haven't used them before#crying",
       "bruh these #JupyterNotebooks are great, I love them #jupyter #majorkeyalert ## # #"]

MyTweets = TweetFrame(tweets=tweetList)

print(MyTweets.getHashtags())

{'#crying', '#python', '#jupyter', '#ipython', '#awesomesauce', '#jupyternotebooks', '#majorkeyalert'}


---