# Cleaning and Structuring the Data


## Task is to:

- Handle missing values
- Remove duplicate or inconsistent data
- Standardize the data format

### Task 1: Identify Issues in the Data

In [7]:
{
    "users": [
        {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
        {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
        {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
        {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
        {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
    ],
    "pages": [
        {"id": 101, "name": "Python Developers"},
        {"id": 102, "name": "Data Science Enthusiasts"},
        {"id": 103, "name": "AI & ML Community"},
        {"id": 104, "name": "Web Dev Hub"},
        {"id": 104, "name": "Web Development"}
    ]
}

{'users': [{'id': 1, 'name': 'Amit', 'friends': [2, 3], 'liked_pages': [101]},
  {'id': 2, 'name': 'Priya', 'friends': [1, 4], 'liked_pages': [102]},
  {'id': 3, 'name': '', 'friends': [1], 'liked_pages': [101, 103]},
  {'id': 4, 'name': 'Sara', 'friends': [2, 2], 'liked_pages': [104]},
  {'id': 5, 'name': 'Amit', 'friends': [], 'liked_pages': []}],
 'pages': [{'id': 101, 'name': 'Python Developers'},
  {'id': 102, 'name': 'Data Science Enthusiasts'},
  {'id': 103, 'name': 'AI & ML Community'},
  {'id': 104, 'name': 'Web Dev Hub'},
  {'id': 104, 'name': 'Web Development'}]}

# Lets Begin


## Task One:


### Problems:

- User ID 3 has an empty name.
- User ID 4 has a duplicate friend entry.
- User ID 5 has no connections or liked pages (inactive user).
- The pages list contains duplicate page IDs.

## Task 2: Clean the Data

### What I'll do :

- Remove users with missing names.
- Remove duplicate friend entries.
- Remove inactive users (users with no friends and no liked pages).
- Remove users with missing names.


### Code implementation:

In [8]:
import json

def clean_data(data):

    #Remove users with missing names
    data["users"] = [user for user in data["users"] if user["name"].strip()]


    #remove duplicate friends
    for user in data["users"]:
        user ['friends'] = list(set(user['friends']))

    #remove inactive users
    data['users'] = [user for user in data["users"] if user['friends'] or user['liked_pages']]

    #remove duplicate pages

    
    unique_pages = {}
    for page in data['pages']:
        unique_pages[page['id']] = page
    data ['pages'] = list(unique_pages.values())


    return data
        
                     
#Load the data

data = json.load(open("data_2.json"))
data = clean_data(data)
json.dump(data, open("Cleaned_data2.json" , "w"), indent = 4)
print("Data has been cleaned")

Data has been cleaned
