# Assignment 3

### A selection of interesting solutions

#### Task 1: Exhaustive error handling and np.delete to clip array

**Note from Teo**: *In contrast to the solution I provided, it is, indeed, more professional to extend functions with **if+raise** and then write the **try+except** in the actual application (that would be in this case the next cell in the assignment notebook, where the array is being clipped).*

In [None]:
def strip_mat(arr):
    """Simple function to strip down a 2-D matrix with at least 3 rows and 3 columns to its inner matrix, by deleting the outer rows and columns"""

    # At least check if the argument is no text or plain number
    if type(arr) == str or type(arr) == float or type(arr) == int:
        raise Exception("Given argument is no array!")
   
    # Litte type cast, not perfect
    # Check for correct dimension
    arr = np.asarray(arr)
    if arr.ndim != 2:
        raise Exception("Given matrix is no 2-D matrix!")
    
    # Check number of matrix rows and columns
    nrow, ncol = arr.shape
    if (nrow or ncol) < 3  :
        raise Exception("Given matrix has too few columns or rows!")
    
    arr = np.delete(arr,0,0)    # delete first row
    arr = np.delete(arr,-1,0)   # delete last row
    arr = np.delete(arr,0,1)    # delete first column 
    
    return np.delete(arr,-1,1)  # delete last column
    

#### Task 2: Very compact

In [None]:
rand_arr=np.random.normal([1, -2], [2, 0.5], (10000, 2))
plt.hist(rand_arr[:,0], bins = 20)
plt.hist(rand_arr[:,1], bins = 20)
plt.show()

#### Task 2: Dynamically extending the random array

In [None]:
rand_arr = np.random.normal(1,2, size = (10000,2))
rand_arr[:,1] = np.random.normal(-2,0.5, size = 10000)

#The 2 sources from where I've learned a lot about Normal (Gaussian) Distribution
#Source 1: https://www.w3schools.com/python/numpy/numpy_random_normal.asp
#Source 2: https://sparkbyexamples.com/numpy/how-to-use-numpy-random-normal-in-python/

#### Task 3: Inner function for transforming one single point

In [None]:
def cartesian_to_polar(arr: np.array) -> np.array:
    """
    Converts the given array of cartesian coordinates into polar coordinates.
    array must be of shape (x,2)
    
    :param arr: Array of cartesian coordinates
    :returns  : Array of polar coordinates
    """
    if arr.shape[1] != 2:
        raise ValueError("Array must be of shape (x, 2)")
    
    def _to_polar(x,y):
        r = np.sqrt(x**2 + y**2)
        phi = np.arctan(y/x)
        
        return r, phi
    
    for i in range(arr.shape[0]):
        x = arr[i, 0]
        y = arr[i, 1]
        
        arr[i, 0], arr[i, 1] = _to_polar(x,y)
        
    return arr

#### Task 3: Complex Numbers

In [None]:
# source: https://stackoverflow.com/questions/20924085/python-conversion-between-coordinates
import cmath

array = np.random.rand(10,2)

for i in range(len(array)):
    input_num = complex(array[i][0], array[i][1]) 
    r, phi = cmath.polar(input_num)
    array[i][0] = r
    array[i][1] = phi

#### Task 3: Checking if the transformatio was right

In [None]:
rng = np.random.default_rng()  
matrix=(100-1)* rng.random(size=(10, 2))+1

matrix[:,0]  # x coordinates
matrix[:,1]  # y coordinates
r=np.sqrt(matrix[:,0]**2 + matrix[:,1]**2)
phi=np.arctan(matrix[:,1]/matrix[:,0])
# this is the converted matrix:
polarmatrix= np.column_stack((r, phi))

# validate the correctness of the matrix transformation 
# by checking the 1st Point with the cmath polar function:
print("r_1 with own transformation: ", r[0])
print("phi_1 with own transformation: ", phi[0])

import cmath
matrix_xy_point1 = complex(matrix[0,0], matrix[0,1]) # stored as 1+2j
r, phi = cmath.polar(matrix_xy_point1)
print("r_1 with cmath: ",r)
print("phi_1 with cmath: ",phi)

#### Task 3: Optional parameter for angle in degrees or radians

In [None]:
def transform_coordinates(x, degree = False):
    """
    This function transforms given cartesian coordinates to polar coordinates. The angle is given in radians. If you want to have it in degrees just add degree = True
    Input:  Array of dimension [n,2] with [:,0] being the x-coordinate and [:,1] being the y-coordinate.
            degree (bool): -> if result should be in degrees or radians. 
    Output: Array of dimention [n,2] with [:,0] being the radius and [:,1] being the angle.
    """
    r = np.sqrt(x[:,0]**2 + x[:,1]**2)
    phi = np.arctan2(x[:,1], x[:,0])
    if degree:
        phi = phi/ (np.pi  /180)
    return np.column_stack((r,phi))

#### Task 4: Extract column, then describe

In [None]:
file = "../data/zuwendungen-berlin.csv.gz"
df = pd.read_csv(file)

print(df['Betrag'].describe(percentiles=[0.5]))

#### Task 4: Dropping unnecessary stats

In [None]:
data = pd.read_csv("..\data\zuwendungen-berlin.csv.gz")
stats = data.describe()
stats.drop(["25%", "75%"], 0 , inplace=True)
result= stats["Betrag"].values.tolist()
print(result)

#### Task 4: Statistics through aggregation

In [None]:
out = (
        df['Betrag'].agg(['count','mean', 'std', 'min','median','max'])
       )
print(out.tolist())

#### Task 5: Lambda-Filter

In [None]:
grouped_df = df.groupby("Name")
dff = grouped_df.apply(lambda d: d["Betrag"].sum()==250).reset_index()
dff[dff[0] == True]["Name"]

#### Task 5: dataframe.filter()

In [None]:
dataframe = df.groupby('Name')
dataframe2 = dataframe.filter(lambda x: x.Betrag.sum() == 250)
print(dataframe2['Name'])

#### Task 7: Regular Expressions to deal with white spaces in Ubahn labels

In [None]:
import os
import re

df = pd.read_csv(os.path.join("data", "zuwendungen-berlin.csv.gz"))
df['Ubahn'] = ""

df_Verkehr = df[(df.Politikbereich == "Verkehr")]

regex_pattern = "U\s?\d" #U1, U 1, U2, U 2, U3, U 3....etc

for i in df_Verkehr.index:
    tempZweck = df_Verkehr.loc[i,'Zweck']
    if type(tempZweck) is str:
        matches = re.findall(regex_pattern, tempZweck)
        df_Verkehr.loc[i,'Ubahn'] = matches[0].strip().replace(" ", "") if len(matches) > 0 else ""

df_Verkehr = df_Verkehr[(df_Verkehr.Ubahn != "")]
ubahn_grouped = df_Verkehr.groupby(['Ubahn'])['Betrag'].agg(['sum']).rename(columns={'sum': 'Total_Spendings'}).reset_index()

print(ubahn_grouped.sort_values(by="Total_Spendings",ascending=False))


#### Task 7: New column through df.assign and replace white spaces with empty strings

In [None]:
df_Verkehr = df[(df.Politikbereich == "Verkehr")]

out = (
        df
          .assign(UBahn= (df["Zweck"].str.extract("(U\s*\d)", expand=False)).str.replace(" ", "")) 
          .groupby("UBahn", as_index=False)["Betrag"].sum()
          .sort_values(by="Betrag", ascending=False, ignore_index=True)
       )
print(out)

#### Task 7: Handling German U-Bahnlinie, too

In [None]:
verkehr= df[(df['Politikbereich']== 'Verkehr')]

data = pd.concat([verkehr['Zweck'],verkehr['Betrag']], axis = 1)

data['U-Bahn'] = data['Zweck'].copy()
for k in range (0,849):
    for i in range(9,0,-1):
        if ((data['Zweck'].iloc[k]).find(f"U {i}") != -1):
            data['U-Bahn'].iloc[k] = i
        elif((data['Zweck'].iloc[k]).find(f'U{i}')!= -1):
            data['U-Bahn'].iloc[k] = i
        elif((data['Zweck'].iloc[k]).find(f'U-Bahnlinie {i}')!= -1):
            data['U-Bahn'].iloc[k] = i
   
    if (  len(str(data['U-Bahn'].iloc[k]))!= 1 ):
        data['U-Bahn'].iloc[k] = np.nan

bahn = data[~data['U-Bahn'].isnull()]
grouped = bahn.groupby(bahn['U-Bahn']).agg(['sum'])
g_sort = grouped.sort_values(by=[('Betrag', 'sum')], ascending=False)

g_sort[('Betrag', 'sum')]

#### Task 7: Sorting with lambda-keys

In [None]:
df4 = df[["Politikbereich", "Zweck", "Betrag"]]
df4 = df4.loc[df4["Politikbereich"] == "Verkehr"]
u_bahn = {}
for i in range(1, 10):
    u_bahn.update({"U" + str(i) : df4.loc[df4["Zweck"].str.contains("u" + str(i), case=False)]["Betrag"].sum()})

for k, v in sorted(u_bahn.items(), key=lambda x: x[1], reverse=True): # every item in u_bahn is a tuple (label, cost)
                                                                      # the lambda key selects the cost to sort by
    print(k)

#### Task 7: Someone took the time to think if the results make sense :^)

In [None]:
# The below code sums up the expenditures of the U-Bahnlines. It doesn't just takes the first mentioned U-Bahnline into account but all mentioned
money_usage = {}
traffic = data_berlin[data_berlin["Politikbereich"].str.match("Verkehr")] # getting just the traffic data of Berlin

for i in range(1,10): # loop over all U-Bahn U1-U9
    str1 = "U " + str(i) # as the U-Bahnlines appear mostly as "U 1" or "U1" I construct both string appearances 
    str2 = "U" + str(i)
    betrag = traffic[traffic["Zweck"].str.contains(str1)]["Betrag"].sum() # getting the summed expenditures for "U x"
    betrag += traffic[traffic["Zweck"].str.contains(str2)]["Betrag"].sum() # getting the summed expenditures for "Ux"
    money_usage.update({str2: betrag}) # connecting the money to the U-Bahn

sorted_money = pd.DataFrame.from_dict(money_usage, orient='index', columns=['Betrag']) # making a pd-DF out of it for nice appearing and easy sorting 
sorted_money.sort_values(by=["Betrag"], ascending = False)

The order shown above makes more or less sense. The cheapest line is the U4 but with just 5 stations it's also the shortest and it is not so much frequented. The most expensive is the U5. This also makes sense as they just finished the part between Alexanderplatz and Hbf. The data was taken from the years before they did the construction. With freezing the soil below the Spree the whole work became a lot more expensive than planned. So the huge expenditure makes sense. I am a bit surprised that the U7 was so cheap as it is the longest line. I would have guessed that it might be among the more expensive ones. It also surprises me that the lines U1 and U3 differ so much even if they share the rails for quite some kilometers. But maybe they just used one line to take the expenditure even if it was for both. For the other lines it is difficult to evaluate as I am missing the expert knowledge. 