# **DATA EXPLORATION & CLEANING.**

## **Libraries.**

In [42]:
import numpy as np
import pandas as pd
import re

## **General view of data.**

### **Import datasets.**

In [43]:
actors = pd.read_csv('../data/actor.csv')
categories = pd.read_csv('../data/category.csv')
films = pd.read_csv('../data/film.csv')
inventories = pd.read_csv('../data/inventory.csv')
languages = pd.read_csv('../data/language.csv')
hdd = pd.read_csv('../data/old_HDD.csv')
rentals = pd.read_csv('../data/rental.csv')

### **Dimensions.**

In [44]:
print('Dimensions of actors DataFrame: ', actors.shape)
print('Dimensions of categories DataFrame: ', categories.shape)
print('Dimensions of films DataFrame: ', films.shape)
print('Dimensions of inventories DataFrame: ', inventories.shape)
print('Dimensions of languages DataFrame: ', languages.shape)
print('Dimensions of old HDD DataFrame: ', hdd.shape)
print('Dimensions of rentals DataFrame: ', rentals.shape)

Dimensions of actors DataFrame:  (200, 4)
Dimensions of categories DataFrame:  (16, 3)
Dimensions of films DataFrame:  (1000, 13)
Dimensions of inventories DataFrame:  (1000, 4)
Dimensions of languages DataFrame:  (6, 3)
Dimensions of old HDD DataFrame:  (1000, 5)
Dimensions of rentals DataFrame:  (1000, 7)


**Let's take a look individually at each dataset.**

### **Actors.**

In [45]:
actors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   actor_id     200 non-null    int64 
 1   first_name   200 non-null    object
 2   last_name    200 non-null    object
 3   last_update  200 non-null    object
dtypes: int64(1), object(3)
memory usage: 6.4+ KB


**There are no null values in the entire table.**

In [46]:
actors

Unnamed: 0,actor_id,first_name,last_name,last_update
0,1,PENELOPE,GUINESS,2006-02-15 04:34:33
1,2,NICK,WAHLBERG,2006-02-15 04:34:33
2,3,ED,CHASE,2006-02-15 04:34:33
3,4,JENNIFER,DAVIS,2006-02-15 04:34:33
4,5,JOHNNY,LOLLOBRIGIDA,2006-02-15 04:34:33
...,...,...,...,...
195,196,BELA,WALKEN,2006-02-15 04:34:33
196,197,REESE,WEST,2006-02-15 04:34:33
197,198,MARY,KEITEL,2006-02-15 04:34:33
198,199,JULIA,FAWCETT,2006-02-15 04:34:33


**Apparently the "last_update" column has the same values, and they are not significant. Let's check it out.**

In [47]:
actors['last_update'].unique()

array(['2006-02-15 04:34:33'], dtype=object)

In [48]:
actors.drop('last_update', axis=1, inplace=True)
actors

Unnamed: 0,actor_id,first_name,last_name
0,1,PENELOPE,GUINESS
1,2,NICK,WAHLBERG
2,3,ED,CHASE
3,4,JENNIFER,DAVIS
4,5,JOHNNY,LOLLOBRIGIDA
...,...,...,...
195,196,BELA,WALKEN
196,197,REESE,WEST
197,198,MARY,KEITEL
198,199,JULIA,FAWCETT


### **Categories.**

In [49]:
categories.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16 entries, 0 to 15
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   category_id  16 non-null     int64 
 1   name         16 non-null     object
 2   last_update  16 non-null     object
dtypes: int64(1), object(2)
memory usage: 512.0+ bytes


**There are no null values in the entire table.**

In [50]:
categories

Unnamed: 0,category_id,name,last_update
0,1,Action,2006-02-15 04:46:27
1,2,Animation,2006-02-15 04:46:27
2,3,Children,2006-02-15 04:46:27
3,4,Classics,2006-02-15 04:46:27
4,5,Comedy,2006-02-15 04:46:27
5,6,Documentary,2006-02-15 04:46:27
6,7,Drama,2006-02-15 04:46:27
7,8,Family,2006-02-15 04:46:27
8,9,Foreign,2006-02-15 04:46:27
9,10,Games,2006-02-15 04:46:27
