# Data Cleaning

- First things first: choosing a dataset and cleaning it, to then proceed to enriching and analyzing it.
- For this project I have chosen "the most followed accounts on Instagram" database from Kaggle.

## 1. Loading the data

#### Let's import the libraries first

In [1]:
import pandas as pd 
import numpy as np
import re
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

#### And now, let's load the data

In [2]:
df = pd.read_csv('most_followed_ig.csv', encoding = 'Windows-1252')
df

Unnamed: 0,RANK,BRAND,CATEGORIES 1,CATEGORIES 2,FOLLOWERS,ER,iPOSTS ON HASHTAG,MEDIA POSTED
0,1,Selena Gomez,celebrities,musicians,105.4Mæ(=),2.62%æ(1342),14.5Mæ(48),1.2kæ(2135)
1,2,Taylor Swift,celebrities,musicians,95.2Mæ(=),1.96%æ(2040),10.5Mæ(66),958æ(2669)
2,3,Ariana Grande,celebrities,musicians,92.3Mæ(=),1.43%æ(2759),16.9Mæ(41),2.8kæ(824)
3,4,Beyonce,celebrities,musicians,90.6Mæ(=),2.53%æ(1427),9.2Mæ(70),1.4kæ(1897)
4,5,Kim Kardashian West,celebrities,tv,89.3Mæ(=),1.39%æ(2812),5.1Mæ(130),3.6kæ(550)
...,...,...,...,...,...,...,...,...
95,96,DanialvesD2 My Twitter,celebrities,athletes,11.7Mæ(=),1.62%æ(2477),122.4kæ(1486),1.7kæ(1508)
96,97,Dolce & Gabbana,fashion,luxury,11.7Mæ(=),0.48%æ(4142),6.1Mæ(105),3.9kæ(471)
97,98,Tyga / T-Raww,celebrities,musicians,11.6Mæ(=),1.31%æ(2922),1.2Mæ(421),2.5kæ(948)
98,99,Paul Labile Pogba,celebrities,athletes,11.5Mæ(=),6.11%æ(170),77.6kæ(1745),396æ(4219)


## 2. Exploratory Analysis

In [3]:
df.shape

(100, 8)

In [4]:
df.dtypes

RANK                  int64
BRAND                object
CATEGORIES 1         object
CATEGORIES 2         object
FOLLOWERS            object
ER                   object
iPOSTS ON HASHTAG    object
MEDIA POSTED         object
dtype: object

In [5]:
df.columns.values

array(['RANK', 'BRAND', 'CATEGORIES 1', 'CATEGORIES 2', 'FOLLOWERS', 'ER',
       'iPOSTS ON HASHTAG', 'MEDIA POSTED'], dtype=object)

### Renaming columns

In [6]:
df.rename(columns = {'CATEGORIES 1':'CATEGORIES', 'CATEGORIES 2':'SUBCATEGORIES'}, inplace = True)

In [7]:
df

Unnamed: 0,RANK,BRAND,CATEGORIES,SUBCATEGORIES,FOLLOWERS,ER,iPOSTS ON HASHTAG,MEDIA POSTED
0,1,Selena Gomez,celebrities,musicians,105.4Mæ(=),2.62%æ(1342),14.5Mæ(48),1.2kæ(2135)
1,2,Taylor Swift,celebrities,musicians,95.2Mæ(=),1.96%æ(2040),10.5Mæ(66),958æ(2669)
2,3,Ariana Grande,celebrities,musicians,92.3Mæ(=),1.43%æ(2759),16.9Mæ(41),2.8kæ(824)
3,4,Beyonce,celebrities,musicians,90.6Mæ(=),2.53%æ(1427),9.2Mæ(70),1.4kæ(1897)
4,5,Kim Kardashian West,celebrities,tv,89.3Mæ(=),1.39%æ(2812),5.1Mæ(130),3.6kæ(550)
...,...,...,...,...,...,...,...,...
95,96,DanialvesD2 My Twitter,celebrities,athletes,11.7Mæ(=),1.62%æ(2477),122.4kæ(1486),1.7kæ(1508)
96,97,Dolce & Gabbana,fashion,luxury,11.7Mæ(=),0.48%æ(4142),6.1Mæ(105),3.9kæ(471)
97,98,Tyga / T-Raww,celebrities,musicians,11.6Mæ(=),1.31%æ(2922),1.2Mæ(421),2.5kæ(948)
98,99,Paul Labile Pogba,celebrities,athletes,11.5Mæ(=),6.11%æ(170),77.6kæ(1745),396æ(4219)


### Removing special characters

In [15]:
df.replace('æ','',regex=True, inplace=True)

In [18]:
df.replace('\(.*\)','',regex=True, inplace=True)

In [19]:
df

Unnamed: 0,RANK,BRAND,CATEGORIES,SUBCATEGORIES,FOLLOWERS,ER,iPOSTS ON HASHTAG,MEDIA POSTED
0,1,Selena Gomez,celebrities,musicians,105.4M,2.62%,14.5M,1.2k
1,2,Taylor Swift,celebrities,musicians,95.2M,1.96%,10.5M,958
2,3,Ariana Grande,celebrities,musicians,92.3M,1.43%,16.9M,2.8k
3,4,Beyonce,celebrities,musicians,90.6M,2.53%,9.2M,1.4k
4,5,Kim Kardashian West,celebrities,tv,89.3M,1.39%,5.1M,3.6k
...,...,...,...,...,...,...,...,...
95,96,DanialvesD2 My Twitter,celebrities,athletes,11.7M,1.62%,122.4k,1.7k
96,97,Dolce & Gabbana,fashion,luxury,11.7M,0.48%,6.1M,3.9k
97,98,Tyga / T-Raww,celebrities,musicians,11.6M,1.31%,1.2M,2.5k
98,99,Paul Labile Pogba,celebrities,athletes,11.5M,6.11%,77.6k,396


### Null values

In [8]:
df.isnull().values.any()

True

In [9]:
df.isnull().sum().sort_values(ascending=False)

SUBCATEGORIES        9
MEDIA POSTED         0
iPOSTS ON HASHTAG    0
ER                   0
FOLLOWERS            0
CATEGORIES           0
BRAND                0
RANK                 0
dtype: int64