# Analysing User Behaviour on Instagram

The original datasets were provided by HiCounselor(https://www.linkedin.com/company/hicounselor/) as part of online "Analysing User Behaviour on Instagram" project.

The primary aim of the current part of the project is to preprocess and clean the dataset to prepare it for the analisys using SQL.

In [1]:
#importing libraries
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings("ignore")

In [2]:
#reading the file
list_of_file_names = ["comments.csv", "follows.csv", "likes.csv", "photo_tags.csv", "photos.csv", "tags.csv", "users.csv"]
data ={}

for name in list_of_file_names:
    readed = pd.read_csv('original-source/{}'.format(name))
    key_name = name.replace(".csv", "")
    data[key_name] = readed

## Exploring the dataset

In [3]:
for file in data:
    print("Name of the File: "'{}'.format(file))
    print(data['{}'.format(file)].head(5))
    print("\n")
  

Name of the File: comments
   id                 comment  User  id  Photo id created Timestamp   
0   1         unde at dolorem         2         1  13-04-2023 08:04  \
1   2         quae ea ducimus         3         1  13-04-2023 08:04   
2   3      alias a voluptatum         5         1  13-04-2023 08:04   
3   4    facere suscipit sunt        14         1  13-04-2023 08:04   
4   5  totam eligendi quaerat        17         1  13-04-2023 08:04   

  posted date emoji used  Hashtags used count  
0    April 14        yes                    1  
1    April 14         no                    2  
2    April 14         no                    4  
3    April 14        yes                    2  
4    April 14        yes                    1  


Name of the File: follows
   follower  followee       created time  is follower active   
0         2          1  13-04-2023 08:04                   1  \
1         2          3  13-04-2023 08:04                   0   
2         2          4  13-04-2023 08:

## Cleaning the dataset
Data Cleaning is a process of identifying and removing irrelevant or redundant columns and renaming the columns to make them more descriptive and consistent with the content of the dataset. This improves the quality and usability of the data for analysis and modeling purposes.

In [4]:
#List of column names that need to be removed or renamed was provided by project organazer as part of task

names_to_drop = {'comments':{'drop':["posted date","emoji used","Hashtags used count"],
                            'rename':{'comment':'comment_text', 'User  id':'user_id',
                                      'Photo id': 'photo_id', 'created Timestamp':'created_at'}},
                'follows':{'drop':["is follower active","followee Acc status"],
                            'rename':{'follower':'follower_id','followee ':'followee_id','created time':'created_at'}},
                 'likes':{'drop':["following or not","like type"],
                            'rename':{'user ':'user_id','photo':'photo_id','created time':'created_at'}},
                 'photo_tags':{'drop':["user id"],
                            'rename':{'photo':'photo_id','tag ID':'tag_id'}},
                 'photos':{'drop':["Insta filter used","photo type"],
                            'rename':{'image link':'image_url','user ID':'user_id','created dat':'created_date'}},
                 'tags':{'drop':["location"],
                            'rename':{'tag text':'tag_name','created time':'created_at'}},
                 'users':{'drop':["private/public","post count","Verified status"],
                            'rename':{'name':'username','created time':'created_at'}}}

for key in names_to_drop.keys():
    a = names_to_drop[key]
    for akey in a.keys():
        if akey == 'drop':
            data[key].drop(columns = a[akey], inplace=True)
        if akey == 'rename':
            data[key].rename(columns = a[akey], inplace=True)

In [5]:
#checking one dataset for correctness
data['comments']

Unnamed: 0,id,comment_text,user_id,photo_id,created_at
0,1,unde at dolorem,2,1,13-04-2023 08:04
1,2,quae ea ducimus,3,1,13-04-2023 08:04
2,3,alias a voluptatum,5,1,13-04-2023 08:04
3,4,facere suscipit sunt,14,1,13-04-2023 08:04
4,5,totam eligendi quaerat,17,1,13-04-2023 08:04
...,...,...,...,...,...
7483,7484,accusamus vel est,82,257,13-04-2023 08:04
7484,7485,sit nulla qui,91,257,13-04-2023 08:04
7485,7486,sed quidem vitae,93,257,13-04-2023 08:04
7486,7487,dolorem eveniet rerum,95,257,13-04-2023 08:04


In [6]:
#export cleaned datasets to newcsv files

for key in data.keys():
    data[key].to_csv(path_or_buf = 
                   'original-cleaned/{}_cleaned.csv'.format(key), index=False)

The cleaned datasets were uploaded into SQL database. For further analysis please refer to 'Analysing_User_Behaviour_on_Instagram.sql' file.