## Context
The aim of this project is to understand how people talk about healthy living on Instagram. What is associated with the idea of healty living?
I used a tool called Apify to scrape Instagram posts with the hashtags #healthyliving, #healthylifestyle and #wellnessjourney. 

In this jupyter notebook, the csv file containing all the posts will be cleaned and standardized.



### Cleaning and exploring the data


In [1]:
import pandas as pd
import re

In [2]:
# to view the whole posts
pd.set_option('display.max_colwidth', -1)

pd.set_option('display.max_rows', None)

In [3]:
df1 = pd.read_csv("/Users/ana/ironhack_coding/projects/instagram-topic-prediction/datasets/healthyliving.csv")
df2 = pd.read_csv("/Users/ana/ironhack_coding/projects/instagram-topic-prediction/datasets/healthylifestyle.csv")
df3 = pd.read_csv("/Users/ana/ironhack_coding/projects/instagram-topic-prediction/datasets/wellnessjourney.csv")


In [None]:
df1.shape

In [None]:
df2.shape

In [None]:
df3.shape

In [None]:
df2.tail()

#### Checking Nulll Values

In [None]:

df1.isnull().sum()

In [None]:
df2.isnull().sum()

In [None]:
df3.isnull().sum()

#### Dropping columns

In [None]:
df1.drop(columns = ["alt","locationName", "ownerUsername","imageUrl", "url"],axis=1, inplace = True)
df2.drop(columns = ["alt","locationName", "ownerUsername", "imageUrl", "url"],axis=1, inplace = True)
df3.drop(columns = ["alt","locationName", "ownerUsername", "imageUrl", "url"],axis=1, inplace = True)

#### Concatenating the datasets

In [None]:
#concatenating the datasets
instagram = pd.concat([df1,df2, df3],axis = 0)

In [None]:
# types
instagram.dtypes

In [None]:
instagram.isnull().sum()

In [None]:
instagram.shape

#### Removing NaN rows

In [None]:
instagram.dropna(subset=["firstComment"], inplace = True)

In [None]:
instagram.isnull().sum()

In [None]:
instagram.shape

#### Creating a column for hashtags

In [None]:
def get_hashtag(s):
    s = s.replace(" ", "")
    return re.findall(r"#([a-z]+)", s, flags=re.IGNORECASE)

In [None]:
instagram["hashtags"] = instagram.firstComment.apply(str).apply(lambda x: get_hashtag(x))

#### Removing hashtags from the body of the post's rows

In [None]:
instagram["post"] = instagram.firstComment.apply(str).str.lower().apply(lambda x: re.sub(r"#([a-z]+)","",x))
                                                 
                                                 

#### Removing duplicate posts

In [None]:
instagram.drop_duplicates(subset="post", inplace= True)

In [None]:
instagram.shape

#### Removing any non-character (e.g. emojis)

In [None]:
instagram["post"] = instagram["post"].apply(str).apply(lambda x: re.sub(r'[^A-Za-z ]', "", x))

In [None]:
instagram["post"].head(30)

#### List of hashtags

In [None]:
hashtags = instagram["hashtags"].explode()

In [None]:
hashtags1 = hashtags.value_counts().head(25).index.tolist()
hashtags1

# Visualisation

In [None]:
import numpy as np
from os import path
from PIL import Image
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
text = ' '.join(hashtags1)

In [None]:
text

In [None]:

wordcloud = WordCloud(max_font_size=50, max_words=100, background_color="white").generate(text)

plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()