# Video 2 : Cleaning and Structuring the Data 
## **Introduction** 

Your manager is impressed with progress but points out that hte data is mesy. Before we caan analyze it effectively. We need to **clean and structure the data** properly 

Your task is to : 
- Handle missing values
- Remove Duplicate or inconsistent data
- Standardize the data format

Let's get started!

-----

## **Task 1: Identify Issues in the Data**
Your manager provides you with an example dataset where some records are incomplete or incorrect. Hereâ€™s an example:

```json
{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
        {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"},
        {"id": 104, "name": "Web Development"}
    ]
}
```

**Problems:**
1. User **ID 3** has an empty name.
2. User **ID 4** has a duplicate friend entry.
3. User **ID 5** has no connections or liked pages (inactive user).
4. The **pages list** contains duplicate page IDs.

---

## **Task 2: Clean the Data**
We will:
1. Remove users with missing names.
2. Remove duplicate friend entries.
3. Remove inactive users (users with no friends and no liked pages).
4. Deduplicate pages based on IDs.

### **Code Implementation**
```python
```


In [1]:
import json 

In [24]:
def clean_data(data):
    #Remove users with missing names 
    data['users']=[user for user in data['users'] if user['name'].strip()]
    #Remove duplicate friends
    for user in data['users']:
        user['friends']=list(set(user['friends']))
    #remove inactive users 
    data['users']=[user for user in data['users'] if user['friends'] or user['liked_pages']]

    #Remove duplicate pages
    unique_pages={}
    for page in data['pages']:
        unique_pages[page['id']]=page
    data['pages']=list(unique_pages.values())
    return data


#load the data 
data=json.load(open("data2.json"))
data=clean_data(data)
json.dump(data,open("cleaned_data2.json",'w'),indent=4)
print("Data has been cleaned sucessfully")

Data has been cleaned sucessfully


## **Expected Output:**
The cleaned dataset will:
- Remove users with missing names

- Ensure friend lists contain unique entries

- Remove inactive users

- Deduplicate pages


---

## **Next Steps**
Your manager is happy with the cleaned data and says: **"Great! Now that our data is structured, let's start analyzing it. First, let's build a 'People You May Know' feature!"**
 