Evaluation of the quality of a sample dataset provided by the client and prepare it for subsequent analysis. The second part of this project is a comprehensive analysis to meet the client's needs.

# Data Description:
The dataset provided by the client is in the form of a Python list, with the following columns:

'user_id': Unique identifier for each user.
'user_name': The name of the user.
'user_age': The age of the user.
'fav_categories': Favorite categories of items purchased by the user, such as 'ELECTRONICS', 'SPORT', and 'BOOKS'.
'total_spendings': A list of integers indicating the total amount spent in each of the favorite categories

In [47]:
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
    ['32761', 'SAMANTHA SMITH', 29.0, ['CLOTHES', 'ELECTRONICS', 'BEAUTY'], [299, 679, 85]],
    ['32984', 'David White', 41.0, ['BOOKS', 'HOME', 'SPORT'], [234, 329, 243]],
    ['33001', 'emily brown', 26.0, ['BEAUTY', 'HOME', 'FOOD'], [213, 659, 79]],
    ['33767', ' Maria Garcia', 33.0, ['CLOTHES', 'FOOD', 'BEAUTY'], [499, 189, 63]],
    ['33912', 'JOSE MARTINEZ', 22.0, ['SPORT', 'ELECTRONICS', 'HOME'], [259, 549, 109]],
    ['34009', 'lisa wilson ', 35.0, ['HOME', 'BOOKS', 'CLOTHES'], [329, 189, 329]],
    ['34278', 'James Lee', 28.0, ['BEAUTY', 'CLOTHES', 'ELECTRONICS'], [189, 299, 579]],
]


# Stage 1: Data Processing

Store 1 aims to ensure consistency in data collection. As part of this initiative, the quality of the data collected about users must be evaluated. The following section reviews the collected data and proposes changes.

In [48]:
user_id = '32415'
user_name = ' mike_reed '
user_age = 32.0
fav_categories = ['ELECTRONICS', 'SPORT', 'BOOKS']

**Observations:**

1. The 'user_id' is correctly formatted as a string. This allows for leading zeros and alphanumeric characters.
2. The 'user_name' variable contains a string with unnecessary spaces and an underscore between the first and last names.
3. The data type of 'user_age' is Float. Ideally, since it is measured in years, it should be converted to an integer. Converting it to a string is not recommended as it complicates data analysis and updates.
4. The 'fav_categories' list contains strings in uppercase. This uniformity is appropriate.

**user_name**

Implementation of the identified changes. First, the issues with the user_name variable need to be corrected. As we observed, it contains unnecessary spaces and an underscore as a separator between the first and last names.

In [49]:
user_name = ' mike_reed '
user_name = user_name.strip()
user_name = user_name.replace("_"," ")

print(user_name)

mike reed


Next, the updated 'user_name' (user's name) should be split into two substrings to obtain a list containing two values: the string for the first name and the string for the last name.

In [50]:
user_name = 'mike reed'
name_split = user_name.split(" ")

print(name_split)

['mike', 'reed']


**user_age**

Now, for the user_age variable. As mentioned, it has an incorrect data type. To resolve this issue, the data type will be transformed, and the final result will be displayed.


In [51]:
user_age = 32.0
user_age = int(user_age)

print(user_age)

32


It is necessary to consider scenarios where the 'user_age' value cannot be converted to an integer. To prevent our system from crashing, we propose a code that attempts to convert the 'user_age' variable to an integer and assigns the transformed value to 'user_age_int'. If the attempt fails, a message is displayed asking the user to provide their age as a numerical value with the message: Please provide your age as a numerical value.

In [52]:
user_age = 'treinta y dos'

try:
    user_age = int(user_age)
except ValueError:
    print("Please provide your age as a numerical value.")

Please provide your age as a numerical value.


# Data Analysis 

The management team at Store 1 has asked for your help in organizing their customer data to better analyze and manage it.

Your task is to sort this list by user ID in ascending order to make it easier to access and analyze.


In [53]:
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
    ['32761', 'SAMANTHA SMITH', 29.0, ['CLOTHES', 'ELECTRONICS', 'BEAUTY'], [299, 679, 85]],
    ['32984', 'David White', 41.0, ['BOOKS', 'HOME', 'SPORT'], [234, 329, 243]],
    ['33001', 'emily brown', 26.0, ['BEAUTY', 'HOME', 'FOOD'], [213, 659, 79]],
    ['33767', ' Maria Garcia', 33.0, ['CLOTHES', 'FOOD', 'BEAUTY'], [499, 189, 63]],
    ['33912', 'JOSE MARTINEZ', 22.0, ['SPORT', 'ELECTRONICS', 'HOME'], [259, 549, 109]],
    ['34009', 'lisa wilson ', 35.0, ['HOME', 'BOOKS', 'CLOTHES'], [329, 189, 329]],
    ['34278', 'James Lee', 28.0, ['BEAUTY', 'CLOTHES', 'ELECTRONICS'], [189, 299, 579]],
]

users.sort(key=lambda x: x[0])

print(users)

[['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]], ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]], ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]], ['32761', 'SAMANTHA SMITH', 29.0, ['CLOTHES', 'ELECTRONICS', 'BEAUTY'], [299, 679, 85]], ['32984', 'David White', 41.0, ['BOOKS', 'HOME', 'SPORT'], [234, 329, 243]], ['33001', 'emily brown', 26.0, ['BEAUTY', 'HOME', 'FOOD'], [213, 659, 79]], ['33767', ' Maria Garcia', 33.0, ['CLOTHES', 'FOOD', 'BEAUTY'], [499, 189, 63]], ['33912', 'JOSE MARTINEZ', 22.0, ['SPORT', 'ELECTRONICS', 'HOME'], [259, 549, 109]], ['34009', 'lisa wilson ', 35.0, ['HOME', 'BOOKS', 'CLOTHES'], [329, 189, 329]], ['34278', 'James Lee', 28.0, ['BEAUTY', 'CLOTHES', 'ELECTRONICS'], [189, 299, 579]]]


We have information on our users' consumption habits, including the amount spent in each of their favorite categories. Management is interested in knowing the total amount spent by each user.

In [54]:
fav_categories_low = ['electronics', 'sport', 'books']
spendings_per_category = [894, 213, 173]

total_amount = sum(spendings_per_category)

print(total_amount)


1280


The company's management has asked us to come up with a way to summarize all the information about a user. The goal is to create a formatted string that uses information from the 'user_id', 'user_name', and 'user_age' variables.

This is the final string we want to create: 'User 32415 is mike who is 32 years old.'.

In [55]:
user_id = '32415'
user_name = ['mike', 'reed']
user_age = 32

user_info = f"User {user_id} is {user_name[0]} who is {user_age} years old."
print(user_info)

User 32415 is mike who is 32 years old.


Management also wants an easy way to know the number of customers whose data we have. The goal is to create a formatted string that shows the number of registered customer data entries.

This is the final string we want to create: 'We have registered data for X customers'.

In [56]:
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
    ['32761', 'SAMANTHA SMITH', 29.0, ['CLOTHES', 'ELECTRONICS', 'BEAUTY'], [299, 679, 85]],
    ['32984', 'David White', 41.0, ['BOOKS', 'HOME', 'SPORT'], [234, 329, 243]],
    ['33001', 'emily brown', 26.0, ['BEAUTY', 'HOME', 'FOOD'], [213, 659, 79]],
    ['33767', ' Maria Garcia', 33.0, ['CLOTHES', 'FOOD', 'BEAUTY'], [499, 189, 63]],
    ['33912', 'JOSE MARTINEZ', 22.0, ['SPORT', 'ELECTRONICS', 'HOME'], [259, 549, 109]],
    ['34009', 'lisa wilson ', 35.0, ['HOME', 'BOOKS', 'CLOTHES'], [329, 189, 329]],
    ['34278', 'James Lee', 28.0, ['BEAUTY', 'CLOTHES', 'ELECTRONICS'], [189, 299, 579]],
]


user_info = f"We have registered data for {len(users)} customers"
print(user_info)

We have registered data for 10 customers


All changes should be applied to the customer list. To simplify the example, a shorter list is shown.

The steps taken are:

1. Remove all leading and trailing spaces from the names, as well as any underscores.
2. Convert all ages to integers.
3. Separate all first and last names into a sublist.

Subsequently, the modified list is saved as a new list called 'users_clean' and display it on the screen.

In [57]:
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
]

users_clean = []

user_name_1 = users[0][1].strip().replace("_", " ").split(" ")
user_age_1 = int(users[0][2])
users_clean.extend([user_name_1, user_age_1])

user_name_2 = users[1][1].strip().replace("_", " ").split(" ")
user_age_2 = int(users[1][2])
users_clean.extend([user_name_2, user_age_2])

user_name_3 = users[2][1].strip().replace("_", " ").split(" ")
user_age_3 = int(users[2][2])
users_clean.extend([user_name_3, user_age_3])

print(users_clean)

[['mike', 'reed'], 32, ['kate', 'morgan'], 24, ['john', 'doe'], 37]


If you want to make these changes directly on the list:

In [58]:
users = [
    ['32415', ' mike_reed ', 32.0, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]],
    ['31980', 'kate morgan', 24.0, ['CLOTHES', 'BOOKS'], [439, 390]],
    ['32156', ' john doe ', 37.0, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]],
]



users = [
    [
        user[0],  
        user[1].strip().replace("_", " ").split(" "), 
        int(user[2]),
        user[3],  
        user[4]
    ] 
    for user in users
]


print(users)


[['32415', ['mike', 'reed'], 32, ['ELECTRONICS', 'SPORT', 'BOOKS'], [894, 213, 173]], ['31980', ['kate', 'morgan'], 24, ['CLOTHES', 'BOOKS'], [439, 390]], ['32156', ['john', 'doe'], 37, ['ELECTRONICS', 'HOME', 'FOOD'], [459, 120, 99]]]


----------
