__Kunskapskontroll 3 - Analys av en datafil__

In [1]:
""" A tool to analyze and tidy up a data table

This tool allows you to analyze and tidy up a data table.

You can:
- Inspect and remove duplicate rows
- Inspect and remove rows with missing values
- Save the edited data to a new file
"""

import pandas as pd

# *** Load data ***
df = pd.read_csv('housing.csv')

# *** List available columns ***
print("Available columns:")
print()
print(df.columns)
print()
print()
# *** Select editing options ***
while True:
    print("Select editing options:")
    print("1. Remove duplicate rows")
    print("2. Remove rows with missing values")
    print("3. Save data")
    edit_option = input("Enter the number of the editing option: ")
# *** Remove duplicate rows ***
    if edit_option == "1":
        # *** List duplicate rows ***
        duplicate_rows = df[df.duplicated()]
        print()
        print("Duplicate rows:")
        print()
        if duplicate_rows.empty:
            print()
            print("No duplicate rows found")
            print()
            print()
            continue
        else:
            print(duplicate_rows)
        do_option = input("Do you want to remove duplicate rows? (y/n): ")
        if do_option == "y":
            print()
            df = df.drop_duplicates()
            print("Duplicate rows removed")
            print()
            print()
        else:
            print()
            print("Duplicate rows not removed")
            print()
            print()

    # *** Remove rows with missing values ***
    elif edit_option == "2":
        # *** List rows with missing values ***
        missing_values = df[df.isnull().any(axis=1)]
        print()
        print("Rows with missing values:")
        print()
        if missing_values.empty:
            print()
            print("No rows with missing values found")
            print()
        else:
            print()
            print(missing_values)
            print()
        do_option = input("Do you want to remove rows with missing values? (y/n): ")
        if do_option == "y":
            df = df.dropna()
            print()
            print("Rows with missing values removed")
            print()
            print()
        else:
            print()
            print("Rows with missing values not removed")
            print()
            print()

    # *** Save data ***
    elif edit_option == "3":
        break

df.to_csv('data_edited.csv', index=False)


print()
print("Data saved to data_edited.csv")
print()

Available columns:

Index(['longitude', 'latitude', 'housing_median_age', 'total_rooms',
       'total_bedrooms', 'population', 'households', 'median_income',
       'median_house_value', 'ocean_proximity'],
      dtype='object')


Select editing options:
1. Remove duplicate rows
2. Remove rows with missing values
3. Save data


Enter the number of the editing option:  1



Duplicate rows:


No duplicate rows found


Select editing options:
1. Remove duplicate rows
2. Remove rows with missing values
3. Save data


Enter the number of the editing option:  2



Rows with missing values:


       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
290      -122.16     37.77                47.0       1256.0             NaN   
341      -122.17     37.75                38.0        992.0             NaN   
538      -122.28     37.78                29.0       5154.0             NaN   
563      -122.24     37.75                45.0        891.0             NaN   
696      -122.10     37.69                41.0        746.0             NaN   
...          ...       ...                 ...          ...             ...   
20267    -119.19     34.20                18.0       3620.0             NaN   
20268    -119.18     34.19                19.0       2393.0             NaN   
20372    -118.88     34.17                15.0       4260.0             NaN   
20460    -118.75     34.29                17.0       5512.0             NaN   
20484    -118.72     34.28                17.0       3051.0             NaN   

       population  hou

Do you want to remove rows with missing values? (y/n):  3



Rows with missing values not removed


Select editing options:
1. Remove duplicate rows
2. Remove rows with missing values
3. Save data


Enter the number of the editing option:  1



Duplicate rows:


No duplicate rows found


Select editing options:
1. Remove duplicate rows
2. Remove rows with missing values
3. Save data


Enter the number of the editing option:  2



Rows with missing values:


       longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
290      -122.16     37.77                47.0       1256.0             NaN   
341      -122.17     37.75                38.0        992.0             NaN   
538      -122.28     37.78                29.0       5154.0             NaN   
563      -122.24     37.75                45.0        891.0             NaN   
696      -122.10     37.69                41.0        746.0             NaN   
...          ...       ...                 ...          ...             ...   
20267    -119.19     34.20                18.0       3620.0             NaN   
20268    -119.18     34.19                19.0       2393.0             NaN   
20372    -118.88     34.17                15.0       4260.0             NaN   
20460    -118.75     34.29                17.0       5512.0             NaN   
20484    -118.72     34.28                17.0       3051.0             NaN   

       population  hou

Do you want to remove rows with missing values? (y/n):  y



Rows with missing values removed


Select editing options:
1. Remove duplicate rows
2. Remove rows with missing values
3. Save data


Enter the number of the editing option:  3



Data saved to data_edited.csv



In [2]:
""" Interactive housing map

This tool allows you to filter the housing data by age, income, house value, population, and proximity to the ocean.
Interactive sliders and a drop-down selector allow you to adjust the filters and see the results immediately.
"""

import pandas as pd
import matplotlib.pyplot as plt
import ipywidgets as widgets
from IPython.display import display, clear_output

# *** Select original or edited data ***
file_sel = input("Select original or edited data (o/e): ")
if file_sel == "o":
    df = pd.read_csv('housing.csv')
elif file_sel == "e":
    df = pd.read_csv('data_edited.csv')

# *** Load data *** 
df = pd.read_csv('housing.csv')

# *** Set slider limits *** 
max_age = df['housing_median_age'].max()
max_income = df['median_income'].max()
max_house_value = df['median_house_value'].max()
min_house_value = df['median_house_value'].min()
max_population = df['population'].max()
min_population = df['population'].min()

#  *** Widget setup of limits and initial values *** 
age_slider = widgets.IntSlider(value=max_age, min=1, max=max_age, step=1, description='House median mge')
income_slider = widgets.FloatSlider(value=0.0, min=0.0, max=max_income, step=0.1, description='Income')
value_slider = widgets.IntSlider(value=min_house_value, min=0.0, max=max_house_value, step=1000, description=' House value')
population_slider = widgets.IntSlider(value=min_population, min=min_population, max=max_population, step=10, description='Population')
ocean_options = ['<any>'] + sorted(df['ocean_proximity'].unique())
proximity_dropdown = widgets.Dropdown(options=ocean_options, description='Proximity')

output = widgets.Output()

# *** Filter data from slider values  *** 
def filter_data(age, income, house_value, population, proximity):
    filtered = df[
        (df['housing_median_age'] <= age) &
        (df['median_income'] >= income) &
        (df['median_house_value'] >= house_value) &
        (df['population'] >= population)
    ]
# *** Filter data from location selector value  *** 
    if proximity != '<any>':
        filtered = filtered[filtered['ocean_proximity'] == proximity]
    return filtered

#  *** Plot data from slider values *** 
def update_plot(*args):
    with output:
        clear_output(wait=True)
        filtered = filter_data(
            age_slider.value,
            income_slider.value,
            value_slider.value,
            population_slider.value,
            proximity_dropdown.value
        )
        if filtered.empty:
            print("No data matching filter criteria.")
            return
        sizes = filtered['population'] / 100
        plt.figure(figsize=(10, 6))
        plt.scatter(
            filtered['longitude'],
            filtered['latitude'],
            c=filtered['population'],
            s=sizes,
            cmap=plt.cm.Set1,
            alpha=0.9
        )
        plt.title("Housing Map: Population by Location")
        plt.xlabel('Longitude')
        plt.ylabel('Latitude')
        plt.colorbar(label='Population')
        plt.show()

# *** Create widgets observer - "event" hooks ***
age_slider.observe(update_plot, names='value')
income_slider.observe(update_plot, names='value')
value_slider.observe(update_plot, names='value')
population_slider.observe(update_plot, names='value')
proximity_dropdown.observe(update_plot, names='value')

# *** Display UI - Plot and filters*
ui = widgets.VBox([age_slider, income_slider, value_slider, population_slider, proximity_dropdown])
display(ui, output)

# *** Initial plot ***
update_plot() 

Select original or edited data (o/e):  e


VBox(children=(IntSlider(value=52, description='House median mge', max=52, min=1), FloatSlider(value=0.0, desc…

Output()

__Självutvärdering:__
1. Vad har varit utmanande?
   - Att hitta ett kreativt angreppssätt, jag snurrade ett tag på kolumnerna med rum och sovrum men fann att de var ganska ointressanta.
     Jag kom på att jag inte ville göra ett otal diagram, istället ville jag skapa en interaktiv widget med filterreglage, då blev bilden klar.
     Koden gav sig själv när målet var tydligt.
   - Att försöka hitta ett sätt att pesentera interaktivt innehåll i Jupyter var ganska utmanande, ryktesvägen sägs det att det kan gå i Jupyter Labs,
     men jag kom inte fram där heller. Möjligen skulle "plotly"-biblioteket fungera, men det kändes out of scope till denna uppgift. Sedan hittade jag "ipywidgets" som faktiskt löste problemet.
2. Jag tycker själv att dessa två script är värda en fullpoängare, de svarar på frågan, med en viss finess. Men jag är medveten om att jag är en
   "Good enough" programmerare så det finns naturigtvis punkter i dessa som man kan kritisera.
3. Kursens bredd hittills är bra, vissa delar är repetition för mig, men matten och matriser som ekvationer är helt nytt för mig.