# Cleaning and Structuring the Data

## Introduction

Your manager is impressed with your progress but points out that the data is messy. Before we can analyze it effectively, we need to clean and structure the data properly.

Your task is to:

- Handle missing values
- Remove duplicate or inconsistent data
- Standardize the data format
- Let's get started!



## Task 1 : Identify Issues in the Data
Your manager provides you with an example dataset where some records are incomplete or incorrect. Here’s an example:

```python
{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
        {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"},
        {"id": 104, "name": "Web Development"}
    ]
}
```

## Problems
- User ID 3 has an empty name
- User ID 4 has a duplicate friend entry.
- User ID 5 has no connections or liked pages (inactive user) 

In [15]:
# # Load the data 
# import json

# def clean_data(data):
#     # Remove users with missing names
#     data['users'] = [user for user in data['users'] if user['name'].strip()]
#     # remove duplicate friends 
#     for user in data['users']:
#         user['friend'] = list(set(user['friends']))

#     # remove inactive users 
#     data['user'] =  [user for user in data['users'] if user['friends'] or user['liked_pages']]
#     return data
# data  = json.load(open("data2.json"))
# data = clean_data(data)

# json.dump(data,open("cleaned_data2.json","w"), indent = 4)
# print("Data has been cleaned successfully")

Data has been cleaned successfully


In [17]:
import json

def clean_data(data):
    # Remove users with missing names
    data["users"] = [user for user in data["users"] if user["name"].strip()]
    
    # Remove duplicate friends
    for user in data["users"]:
        user["friends"] = list(set(user["friends"]))
    
    # Remove inactive users
    data["users"] = [user for user in data["users"] if user["friends"] or user["liked_pages"]]
    
    # Remove duplicate pages
    unique_pages = {}
    for page in data["pages"]:
        unique_pages[page["id"]] = page
    data["pages"] = list(unique_pages.values())
    
    return data

# Load, clean, and display the cleaned data
data = json.load(open("data2.json"))
data = clean_data(data)
json.dump(data, open("cleaned_codebook_data.json", "w"), indent=4)
print("Data cleaned successfully!")

Data cleaned successfully!
