### Welcome to CodeBook – Your Data Science Internship Begins!

#### Introduction

Congratulations! You have just been hired as a Data Scientist Intern at CodeBook – The Social Media for Coders.
This Delhi-based company is offering you a ₹10 LPA job if you successfully complete this 1-month internship.
But before you get there, you must prove your skills using only Python—no pandas, NumPy, or fancy libraries!

Your manager Puneet Kumar has assigned you your first task: analyzing a data dump of CodeBook users using pure Python.
Your job is to load and explore the data to understand its structure.


Task 1: Load the User Data
Your manager has given you a dataset containing information about CodeBook users, their connections (friends), and the posts they have liked.

This is how the data will look like (in JSON format):

{
  "users": [
    {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
    {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
    {"id": 3, "name": "Rahul", "friends": [1], "liked_pages": [101, 103]},
    {"id": 4, "name": "Sara", "friends": [2], "liked_pages": [104]}
  ],
  "pages": [
    {"id": 101, "name": "Python Developers"},
    {"id": 102, "name": "Data Science Enthusiasts"},
    {"id": 103, "name": "AI & ML Community"},
    {"id": 104, "name": "Web Dev Hub"}
  ]
}

#### Explanation:

Read this data and understand its structure. The data contains three main components:
1. Users: Each user has an ID, name, a list of friends (by their IDs), and a list of liked pages (by their IDs).

2. Pages: Each page has an ID and a name.

3. Connections:Users can have multiple friends and can like multiple pages.

In [29]:
import json

# loading data

def load_data(filename):
  with open(filename,'r') as file:
    data = json.load(file)
    return data

In [30]:
data = load_data('users_data.json')

data

{'users': [{'id': 1, 'name': 'Amit', 'friends': [2, 3], 'liked_pages': [101]},
  {'id': 2, 'name': 'Priya', 'friends': [1, 4], 'liked_pages': [102]},
  {'id': 3, 'name': 'Rahul', 'friends': [1], 'liked_pages': [101, 103]},
  {'id': 4, 'name': 'Sara', 'friends': [2], 'liked_pages': [104]}],
 'pages': [{'id': 101, 'name': 'Python Developers'},
  {'id': 102, 'name': 'Data Science Enthusiasts'},
  {'id': 103, 'name': 'AI & ML Community'},
  {'id': 104, 'name': 'Web Dev Hub'}]}

In [31]:
def display(data):
  print("User's Details:-")
  for i in data['users']:
    print(f"ID : {i['id']} - {i['name']} is friends with {i['friends']} and their liked pages are {i['liked_pages']}")
      
  print("\nPage's Details:-")
  for x in data['pages']:
      print(f"ID : {x['id']} - {x['name']}")

display(data)

User's Details:-
ID : 1 - Amit is friends with [2, 3] and their liked pages are [101]
ID : 2 - Priya is friends with [1, 4] and their liked pages are [102]
ID : 3 - Rahul is friends with [1] and their liked pages are [101, 103]
ID : 4 - Sara is friends with [2] and their liked pages are [104]

Page's Details:-
ID : 101 - Python Developers
ID : 102 - Data Science Enthusiasts
ID : 103 - AI & ML Community
ID : 104 - Web Dev Hub


## Cleaning and Structuring the Data

### Introduction

Your manager is impressed with your progress but points out that the data is messy.
Before we can analyze it effectively, we need to clean and structure the data properly.

Your task is to:
 * Handle missing values
 * Remove duplicate or inconsistent data
 * Standardize the data format

Let's get started!

#### Task 1: Identify Issues in the Data
Your manager provides you with an example dataset where some records are incomplete or incorrect.
Here’s an example:

{
  "users": [
    {"id": 1, "name": "Amit", "friends": [2, 3], "liked_pages": [101]},
    {"id": 2, "name": "Priya", "friends": [1, 4], "liked_pages": [102]},
    {"id": 3, "name": "", "friends": [1], "liked_pages": [101, 103]},
    {"id": 4, "name": "Sara", "friends": [2, 2], "liked_pages": [104]},
    {"id": 5, "name": "Amit", "friends": [], "liked_pages": []}
  ],
  "pages": [
    {"id": 101, "name": "Python Developers"},
    {"id": 102, "name": "Data Science Enthusiasts"},
    {"id": 103, "name": "AI & ML Community"},
    {"id": 104, "name": "Web Dev Hub"},
    {"id": 104, "name": "Web Development"}
  ]
}

#### Problems:
* User ID 3 has an empty name.
* User ID 4 has a duplicate friend entry.
* User ID 5 has no connections or liked pages (inactive user).
* The pages list contains duplicate page IDs.

In [42]:
def clean_data(data):
    # remove users with the missing names
    data['users'] = [user for user in data["users"] if user['name'].strip()]

    # remove duplicates friends entry
    for user in data["users"]:
        user['friends'] = list(set(user['friends']))

    # remove inactive users
    data['users'] = [user for user in data['users'] if user['friends'] or user['liked_pages']]

    # duplicates pages
    uniques_pages = {}
    for page in data['pages']:
        uniques_pages[page['id']] = page
    data['pages'] = list(uniques_pages.values())
    return data



clean_data(data)

{'users': [{'id': 1, 'name': 'Amit', 'friends': [2, 3], 'liked_pages': [101]},
  {'id': 2, 'name': 'Priya', 'friends': [1, 4], 'liked_pages': [102]},
  {'id': 4, 'name': 'Sara', 'friends': [2], 'liked_pages': [104]}],
 'pages': [{'id': 101, 'name': 'Python Developers'},
  {'id': 102, 'name': 'Data Science Enthusiasts'},
  {'id': 103, 'name': 'AI & ML Community'},
  {'id': 104, 'name': 'Web Development'}]}

In [39]:
data = json.load(open('data.json'))
data = clean_data(data)
json.dump(data,open('clean_data.json','w'),indent = 4)
print("Data has been cleaned successfully")

Data has been cleaned successfully


### Next Steps
Your manager is happy with the cleaned data and says: "Great! Now that our data is structured, let's start analyzing it. First, let's build a 'People You May Know' feature!"

In [50]:
def find_people_may_know(user_id,data):
    user_friends = {}
    for user in data['users']:
        user_friends[user['id']] = set(user['friends'])

    if user_id not in user_friends:
        return []

    direct_friends = user_friends[user_id]

    for friends in direct_friends:
        print(friends)
        
    # return user_friends

data = load_data('clean_data.json')
user_id = 2

find_people_may_know(user_id,data)


1
4


### Next Steps
Wow! Let's build one more last feature, 'Page recommendation'