# Question 1

```
What kind of analysis would you do to evaluate the quality of a segmentation methodology?
```

To evaluate the quality of a segmentation methodology, we need to observe it from the perspective of the end goal.

Here we want to determine if a given trend (e.g., wearing clothes in lycra) is included in one of our segments.

Multiple possibilities:
- Chi-squared test
  - Allows us to detect if there is a significant difference between our full dataset and our trend in the proportion of our segments. We could use it to identify trends where one of our segments is over-represented, suggesting a stronger link.
  - Pros: It's computationally efficient.
  - Cons: Harder to interpret with groups of vastly varying sizes.
- Jaccard similarity:
  - Purpose: Measures overlap between sets (e.g., trends in different segments)
  - Pros: Intuitive interpretation; focuses on shared elements rather than just counts
  - Cons: Less effective when one group is much larger, as between a trend and a segment.
- Lift Analysis:
  - Purpose: Measures how much more likely a particular trend appears in one segment versus overall population
  - Pros: Directly quantifies segment relevance to specific trends; intuitive business interpretation
  - Cons: Can be misleading with small sample sizes; doesn't account for temporal patterns


I will select Lift Analysis to evaluate the segmentation methodology, as its cons are not relevant here.


```
What biases can you identify in the panel creation or in the accounts selection that might distort insights and how would you correct them?
```

Currently, the panel creation is based solely on the number of followers.
While this is an easy measure to interpret and compute, it comes with limitations:
- Having only the raw number of followers can be misleading between accounts with low and high engagement.
- Differentiating between trends like "Sportswear" or "Luxury" would be difficult as both subjects can appear independently of the number of followers.

Concerning the account selection, I can see that my sample is 100% based in the US, which might distort insights.
I will correct this by creating new panels based on additional metrics.

```
Feel free to use the data at your disposal to support your answer.
```

To correctly evaluate a segmentation methodology, I need a trend to compare it against.

To keep it simple, I will look at how worn patterns differ between summer and winter 2023.
Later on, it would be interesting to find more complex trends using itemset mining.
Here I define a trend as a pattern which has at least a 30% difference in its mean appearance between summer and winter.

I will then use lift analysis to compare how much of a trend can be explained by one of the base segmentation groups.

In [1]:
from src.data.trend_single_pattern import detect_trend_single_pattern

detect_trend_single_pattern()

Loading data...
Data loaded


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pattern_df.loc[:, "month_year"] = pattern_df["POST_PUBLICATION_DATE"].dt.to_period("M")


Saved 901 users for winter pattern 'vinyl' to data/1_interim/simple_trends/pattern/23W_vinyl.parquet
Saved 209 users for winter pattern 'tweed' to data/1_interim/simple_trends/pattern/23W_tweed.parquet
Saved 652 users for winter pattern 'suede' to data/1_interim/simple_trends/pattern/23W_suede.parquet
Saved 3315 users for winter pattern 'velvet' to data/1_interim/simple_trends/pattern/23W_velvet.parquet
Saved 269 users for winter pattern 'mohair' to data/1_interim/simple_trends/pattern/23W_mohair.parquet
Saved 4205 users for winter pattern 'fur' to data/1_interim/simple_trends/pattern/23W_fur.parquet
Saved 5122 users for winter pattern 'leather' to data/1_interim/simple_trends/pattern/23W_leather.parquet
Saved 2379 users for winter pattern 'bigpadded' to data/1_interim/simple_trends/pattern/23W_bigpadded.parquet
Saved 1361 users for winter pattern 'linen' to data/1_interim/simple_trends/pattern/23S_linen.parquet
Saved 1827 users for winter pattern 'tiedye' to data/1_interim/simple_tren

In [2]:
from src.data.evaluate import evaluate_trends

evaluate_trends("data/1_interim/simple_trends/pattern")

Unnamed: 0,trend_name,BASELINE_SEGMENTATION=EDGY,BASELINE_SEGMENTATION=EDGY_lift,BASELINE_SEGMENTATION=MAINSTREAM,BASELINE_SEGMENTATION=MAINSTREAM_lift,BASELINE_SEGMENTATION=TRENDY,BASELINE_SEGMENTATION=TRENDY_lift,Total,max_lift
3,23S_taffeta,0.244648,4.723366,0.620795,0.701848,0.134557,2.112699,337,4.723366
4,23W_tweed,0.220588,4.258844,0.651961,0.737083,0.127451,2.001133,209,4.258844
8,23W_mohair,0.186312,3.597077,0.688213,0.778068,0.125475,1.970112,269,3.597077
10,23W_vinyl,0.14518,2.802956,0.75029,0.84825,0.10453,1.64124,901,2.802956
12,23W_suede,0.142857,2.758108,0.759812,0.859015,0.097331,1.528217,652,2.758108
0,23S_linen,0.118886,2.295311,0.791573,0.894923,0.089541,1.405901,1361,2.295311
9,23W_velvet,0.099369,1.918495,0.81041,0.916219,0.090221,1.416575,3315,1.918495
2,23W_leather,0.089521,1.728369,0.819513,0.926511,0.090965,1.428265,5122,1.728369
5,23W_bigpadded,0.089005,1.718403,0.828098,0.936216,0.082897,1.301583,2379,1.718403
11,23W_fur,0.087097,1.681556,0.828536,0.936712,0.084367,1.324667,4205,1.681556


# Question 2

```
What do you think about the current segmentation? 
```

The baseline segmentation is much more effective than I anticipated.

With 6 trends out of 13 which have a lift above 2, indicating the segmentation is relevant, and 3 of them above 3 showing a high representation of the trend. 


```
What advantages / drawbacks do you see in the methodology / in the way the information is provided? 
```

There is a lot of complexity involved in determining relevant trends, so that a segmentation quality can be evaluated.

The complexity in evaluating trends is linked to the data format of the image_labels, with many characteristics. Going beyond a single element's popularity to detect trends is in itself a project.
With more time I would use itemset mining strategies to test the segmentation on more precisely defined trends.

```
You can play around with the provided data to support your statements.
```

In [3]:
import pandas as pd
df_authors_with_segmentation = pd.read_parquet(
	"data/1_interim/extended_data/merged_authors_extended.parquet"
)
df_authors_with_segmentation

Unnamed: 0,AUTHORID,BIO,NB_FOLLOWS,GEO_ZONE,FASHION_INTEREST_SEGMENT,NB_FOLLOWERS,BASELINE_SEGMENTATION,total_posts,mean_likes,median_likes,max_likes,mean_comments,median_comments,max_comments,total_collabs,PROP_COLLAB_POSTS,PROP_SPORT_POSTS
0,10371901,,1089.0,us,True,600340.0,EDGY,220.0,39.609091,0.0,1851.0,43.622727,35.0,223.0,0.0,0.0,0.643192
1,6672672415,,23.0,us,True,145477.0,EDGY,11.0,1572.000000,1446.0,4652.0,53.363636,30.0,197.0,0.0,0.0,0.555556
2,47810645367,💼 Meme Dealer | Philosopher | Poet | Explorer\...,122.0,us,False,20.0,MAINSTREAM,,,,,,,,,0.0,0.000000
3,43850911627,,945.0,us,False,1004.0,MAINSTREAM,13.0,34.076923,0.0,236.0,8.076923,9.0,14.0,0.0,0.0,0.923077
4,1524262985,"I'm a MN hemp farmer, specializing in holistic...",395.0,us,False,290.0,MAINSTREAM,2.0,75.500000,75.5,151.0,3.500000,3.5,7.0,0.0,0.0,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22434,48581377338,,,us,False,224.0,MAINSTREAM,1.0,3.000000,3.0,3.0,1.000000,1.0,1.0,0.0,0.0,1.000000
22435,4788566,,,us,False,62918.0,EDGY,,,,,,,,,0.0,0.000000
22436,14409954752,| authentic. cozy. real. raw. |\nNow booking 2...,403.0,us,False,629.0,MAINSTREAM,100.0,94.410000,86.5,293.0,8.870000,3.0,468.0,0.0,0.0,0.731959
22437,3919424416,Introducing Jacksonville to Jacksonville. High...,2977.0,us,False,14835.0,TRENDY,3.0,1543.666667,622.0,3630.0,102.333333,33.0,255.0,0.0,0.0,0.333333



Looking at derivative information from the post captions and the non-clothing detected objects, I will test 2 new segmentations:

- A sport segmentation, depending on the proportion of sports items (ski, surfboard,...) found in the posts of a given user.
- A segmentation based on the proportion of user posts which indicate a paid sponsorship.

In [4]:
collab_mask = df_authors_with_segmentation["PROP_COLLAB_POSTS"] >= 0.95
df_authors_with_segmentation.loc[collab_mask, "IS_COLLAB"] = 1
df_authors_with_segmentation.loc[~collab_mask, "IS_COLLAB"] = 0

sport_mask_none = df_authors_with_segmentation["PROP_SPORT_POSTS"] == 0
sport_mask_low = df_authors_with_segmentation["PROP_SPORT_POSTS"] <= 1 / 3
sport_mask_high = df_authors_with_segmentation["PROP_SPORT_POSTS"] >= 2 / 3
df_authors_with_segmentation.loc[sport_mask_low, "IS_SPORT"] = "LOW"
df_authors_with_segmentation.loc[sport_mask_high, "IS_SPORT"] = "HIGH"
df_authors_with_segmentation.loc[~sport_mask_high & ~sport_mask_low, "IS_SPORT"] = "MEDIUM"
df_authors_with_segmentation.loc[sport_mask_none, "IS_SPORT"] = "NONE"


df_authors_with_segmentation

Unnamed: 0,AUTHORID,BIO,NB_FOLLOWS,GEO_ZONE,FASHION_INTEREST_SEGMENT,NB_FOLLOWERS,BASELINE_SEGMENTATION,total_posts,mean_likes,median_likes,max_likes,mean_comments,median_comments,max_comments,total_collabs,PROP_COLLAB_POSTS,PROP_SPORT_POSTS,IS_COLLAB,IS_SPORT
0,10371901,,1089.0,us,True,600340.0,EDGY,220.0,39.609091,0.0,1851.0,43.622727,35.0,223.0,0.0,0.0,0.643192,0.0,MEDIUM
1,6672672415,,23.0,us,True,145477.0,EDGY,11.0,1572.000000,1446.0,4652.0,53.363636,30.0,197.0,0.0,0.0,0.555556,0.0,MEDIUM
2,47810645367,💼 Meme Dealer | Philosopher | Poet | Explorer\...,122.0,us,False,20.0,MAINSTREAM,,,,,,,,,0.0,0.000000,0.0,NONE
3,43850911627,,945.0,us,False,1004.0,MAINSTREAM,13.0,34.076923,0.0,236.0,8.076923,9.0,14.0,0.0,0.0,0.923077,0.0,HIGH
4,1524262985,"I'm a MN hemp farmer, specializing in holistic...",395.0,us,False,290.0,MAINSTREAM,2.0,75.500000,75.5,151.0,3.500000,3.5,7.0,0.0,0.0,0.000000,0.0,NONE
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22434,48581377338,,,us,False,224.0,MAINSTREAM,1.0,3.000000,3.0,3.0,1.000000,1.0,1.0,0.0,0.0,1.000000,0.0,HIGH
22435,4788566,,,us,False,62918.0,EDGY,,,,,,,,,0.0,0.000000,0.0,NONE
22436,14409954752,| authentic. cozy. real. raw. |\nNow booking 2...,403.0,us,False,629.0,MAINSTREAM,100.0,94.410000,86.5,293.0,8.870000,3.0,468.0,0.0,0.0,0.731959,0.0,HIGH
22437,3919424416,Introducing Jacksonville to Jacksonville. High...,2977.0,us,False,14835.0,TRENDY,3.0,1543.666667,622.0,3630.0,102.333333,33.0,255.0,0.0,0.0,0.333333,0.0,LOW


In [5]:
evaluate_trends(
    "data/1_interim/simple_trends/pattern",
    df_authors_with_segmentation,
    segment_column="IS_COLLAB",
)

Unnamed: 0,trend_name,IS_COLLAB=0.0,IS_COLLAB=0.0_lift,IS_COLLAB=1.0,IS_COLLAB=1.0_lift,Total,max_lift
12,23W_suede,0.99843,0.999544,0.00157,1.409042,652,1.409042
1,23S_tiedye,1.0,1.001115,0.0,0.0,1827,1.001115
3,23S_taffeta,1.0,1.001115,0.0,0.0,337,1.001115
4,23W_tweed,1.0,1.001115,0.0,0.0,209,1.001115
0,23S_linen,1.0,1.001115,0.0,0.0,1361,1.001115
10,23W_vinyl,1.0,1.001115,0.0,0.0,901,1.001115
5,23W_bigpadded,1.0,1.001115,0.0,0.0,2379,1.001115
8,23W_mohair,1.0,1.001115,0.0,0.0,269,1.001115
7,23S_flowers,1.0,1.001115,0.0,0.0,2073,1.001115
11,23W_fur,1.0,1.001115,0.0,0.0,4205,1.001115


In [6]:
evaluate_trends(
    "data/1_interim/simple_trends/pattern",
    df_authors_with_segmentation,
    segment_column="IS_SPORT",
)

Unnamed: 0,trend_name,IS_SPORT=MEDIUM,IS_SPORT=MEDIUM_lift,IS_SPORT=NONE,IS_SPORT=NONE_lift,IS_SPORT=HIGH,IS_SPORT=HIGH_lift,IS_SPORT=LOW,IS_SPORT=LOW_lift,Total,max_lift
4,23W_tweed,0.52451,1.48081,0.0,0.0,0.357843,1.796742,0.117647,0.493528,209,1.796742
10,23W_vinyl,0.589327,1.663804,0.0,0.0,0.219258,1.1009,0.191415,0.802985,901,1.663804
7,23S_flowers,0.589143,1.663285,0.0,0.0,0.239542,1.202748,0.171315,0.718664,2073,1.663285
3,23S_taffeta,0.58104,1.640407,0.0,0.0,0.330275,1.658323,0.088685,0.372033,337,1.658323
1,23S_tiedye,0.586994,1.657216,0.0,0.0,0.171135,0.859276,0.241871,1.014647,1827,1.657216
8,23W_mohair,0.574144,1.62094,0.0,0.0,0.1673,0.840021,0.258555,1.084636,269,1.62094
9,23W_velvet,0.572375,1.615943,0.0,0.0,0.231788,1.163816,0.195837,0.821535,3315,1.615943
0,23S_linen,0.571106,1.612362,0.0,0.0,0.285177,1.431882,0.143717,0.602892,1361,1.612362
2,23W_leather,0.565388,1.596217,0.0,0.0,0.23948,1.202438,0.195132,0.818577,5122,1.596217
12,23W_suede,0.56044,1.582248,0.0,0.0,0.276295,1.387287,0.163265,0.684896,652,1.582248


Looking at these 2 segmentations, I see they are less descriptive than my baseline method.
Thinking about it, I think that evaluating on a trend pattern might be less relevant.
- Sponsorship might show a better segmentation quality when studying brand trends.
- Similarly, some brand shoes have different popularity in different sports, making this an interesting subject to study.

In [7]:
from src.data.trend_single_type import detect_trend_single_type

detect_trend_single_type("shoe_brand")

Loading data...
Data loaded
Found 21 unique shoe_brand
Saved 8187 users for 'on' to data/1_interim/simple_trends/shoe_brand/on.parquet
Saved 5917 users for 'nike' to data/1_interim/simple_trends/shoe_brand/nike.parquet
Saved 484 users for 'louboutin' to data/1_interim/simple_trends/shoe_brand/louboutin.parquet
Saved 3495 users for 'adidas' to data/1_interim/simple_trends/shoe_brand/adidas.parquet
Saved 233 users for 'bottegaveneta' to data/1_interim/simple_trends/shoe_brand/bottegaveneta.parquet
Saved 329 users for 'louisvuitton' to data/1_interim/simple_trends/shoe_brand/louisvuitton.parquet
Saved 222 users for 'hermes' to data/1_interim/simple_trends/shoe_brand/hermes.parquet
Saved 472 users for 'gucci' to data/1_interim/simple_trends/shoe_brand/gucci.parquet
Saved 722 users for 'dior' to data/1_interim/simple_trends/shoe_brand/dior.parquet
Saved 811 users for 'new_balance' to data/1_interim/simple_trends/shoe_brand/new_balance.parquet
Saved 255 users for 'prada' to data/1_interim/si

In [8]:
evaluate_trends(
    "data/1_interim/simple_trends/shoe_brand",
    df_authors_with_segmentation,
    segment_column="IS_COLLAB",
)

Unnamed: 0,trend_name,IS_COLLAB=0.0,IS_COLLAB=0.0_lift,IS_COLLAB=1.0,IS_COLLAB=1.0_lift,Total,max_lift
0,balencia,1.0,1.001115,0.0,0.0,888,1.001115
1,jacquemus,1.0,1.001115,0.0,0.0,1,1.001115
3,miu_miu,1.0,1.001115,0.0,0.0,309,1.001115
4,hermes,1.0,1.001115,0.0,0.0,222,1.001115
5,louboutin,1.0,1.001115,0.0,0.0,484,1.001115
18,new_balance,1.0,1.001115,0.0,0.0,811,1.001115
7,ysl,1.0,1.001115,0.0,0.0,43,1.001115
8,jimmychoo,1.0,1.001115,0.0,0.0,2,1.001115
9,dior,1.0,1.001115,0.0,0.0,722,1.001115
11,chloe,1.0,1.001115,0.0,0.0,78,1.001115


In [9]:
evaluate_trends(
    "data/1_interim/simple_trends/shoe_brand",
    df_authors_with_segmentation,
    segment_column="IS_SPORT",
)

Unnamed: 0,trend_name,IS_SPORT=MEDIUM,IS_SPORT=MEDIUM_lift,IS_SPORT=NONE,IS_SPORT=NONE_lift,IS_SPORT=HIGH,IS_SPORT=HIGH_lift,IS_SPORT=LOW,IS_SPORT=LOW_lift,Total,max_lift
1,jacquemus,0.0,0.0,0.0,0.0,1.0,5.021034,0.0,0.0,1,5.021034
8,jimmychoo,0.5,1.411613,0.0,0.0,0.5,2.510517,0.0,0.0,2,2.510517
11,chloe,0.434211,1.225874,0.0,0.0,0.407895,2.048053,0.157895,0.662367,78,2.048053
13,bottegaveneta,0.537118,1.516405,0.0,0.0,0.406114,2.03911,0.056769,0.238144,233,2.03911
4,hermes,0.516129,1.457149,0.0,0.0,0.396313,1.989903,0.087558,0.367303,222,1.989903
7,ysl,0.536585,1.514902,0.0,0.0,0.390244,1.959428,0.073171,0.30695,43,1.959428
20,chanel,0.540698,1.526512,0.0,0.0,0.377907,1.897484,0.081395,0.341453,176,1.897484
16,givenchy,0.666667,1.882151,0.0,0.0,0.0,0.0,0.333333,1.39833,3,1.882151
10,veja,0.535088,1.510674,0.0,0.0,0.368421,1.849855,0.096491,0.40478,115,1.849855
14,alexandermcqueen,0.588652,1.661899,0.0,0.0,0.35461,1.780508,0.056738,0.238014,143,1.780508


In [10]:
evaluate_trends(
    "data/1_interim/simple_trends/shoe_brand",
    df_authors_with_segmentation,
    segment_column="BASELINE_SEGMENTATION",
)

Unnamed: 0,trend_name,BASELINE_SEGMENTATION=EDGY,BASELINE_SEGMENTATION=EDGY_lift,BASELINE_SEGMENTATION=MAINSTREAM,BASELINE_SEGMENTATION=MAINSTREAM_lift,BASELINE_SEGMENTATION=TRENDY,BASELINE_SEGMENTATION=TRENDY_lift,Total,max_lift
8,jimmychoo,0.0,0.0,0.0,0.0,1.0,15.701198,2,15.701198
20,chanel,0.348837,6.734916,0.459302,0.51927,0.19186,3.012439,176,6.734916
15,prada,0.268293,5.179862,0.585366,0.661793,0.146341,2.297736,255,5.179862
13,bottegaveneta,0.248908,4.805613,0.611354,0.691174,0.139738,2.194054,233,4.805613
7,ysl,0.243902,4.708966,0.585366,0.661793,0.170732,2.680692,43,4.708966
10,veja,0.210526,4.064581,0.701754,0.793377,0.087719,1.377298,115,4.064581
4,hermes,0.198157,3.825763,0.668203,0.755445,0.133641,2.098317,222,3.825763
3,miu_miu,0.197368,3.810545,0.671053,0.758667,0.131579,2.065947,309,3.810545
5,louboutin,0.186147,3.593899,0.686147,0.775732,0.127706,2.005131,484,3.593899
14,alexandermcqueen,0.177305,3.423184,0.751773,0.849927,0.070922,1.11356,143,3.423184


The baseline segmentation still gives great results, with a much stronger performance than I would have expected.

The collab segmentation is not delivering results; I think it's a less interesting direction than I anticipated.

The sport segmentation delivers some great insights, as it provides better quality for the trends linked to brands like Jacquemus or Chloé, but it has a smaller number of trends that this segmentation can explain in comparison to the baseline 

# Question 3

```
We would like to push our segmentation one step further. The current methodology refrains us from spotting interesting (sub)categories of accounts like fashion enthusiasts, luxury, sportswear etc. Several families of methods can be considered to improve this methodology:
- Statistical approach
- Machine Learning approaches (supervised and/or unsupervised)
- Hybrid approach combining deterministic rules and machine learning
Which approach would you prefer and why? 
```

I think a statistical approach would be great to deploy as a baseline method, as it has low R&D associated with it and can be put quickly into production.

As statistical approaches will be blind to complex associations, a second approach using Machine Learning would be interesting.

I definitely think that using domain expert intuition to focus the ML experimentation on detecting specific segmentation groups known to be of interest to Heuritech clients would allow quicker development than starting by trying to build a holistic method.


From my understanding, the problem formulation has 3 areas:
- Determining trends to evaluate how relevant segmentation groups are
  - An interesting direction would be to detect macro trends (90s clothing, casual,...) by doing itemset mining on the different trends picked by Instagram users. Apriori algorithm is generally a go-to solution, but with the large amount of data handled by Heuritech, other algorithms might be more interesting like MinHash, LCM (Linear time Closed itemset Miner), or Count-Min Sketch.
  - Another interesting methodology would be to do unsupervised clustering on one-hot encoded vectors of all the present labels for a given image. Previously this method could be quite costly on the compute side, but with commoditization of vector databases and their index creation, there has been a lot of efficiency gained, and it would be a one-time computation.
- Proposing segmentation groups
  - I think there is a lot of value in exploring the statistics of already predicted information. For example, the luxury panel could be created by selecting users having posted images of high-end brands. 
  - Finding segmentation groups could be seen as a similar problem to trend detection, as multiple trends might show fashion styles representative of some groups of people.
  - Another possibility would be to create embeddings of the post captions, and cluster them into predefined groups by similarity (for example using strings like "luxury" or "football"). There is a cost involved in the first computation of embeddings, but by storing them (with the appropriate index) it is then possible to query them on as many sentence similarities as desired with only a marginal cost added. 
  - On the other end of the spectrum of R&D, more advanced techniques that I'm starting to read more about, like Graph Neural Networks could be promising especially with the current database structure. I could cite recent research like https://github.com/snap-stanford/relbench which could offer highly descriptive embeddings for users/clothes nodes, and even edge embeddings to represent trends.
- Evaluating the quality of the segmentation groups
  - The other statistical methods could be tested to choose the best to represent end business cases (might be different from project to project)
  - The current methods mostly focus on how much a current trend is captured by a group, but asking the question from the other direction (for a given new trend, how much of our segmentation group would follow it) might be interesting for clothing brands looking to optimize for user adoption.
  - Using ML approaches, like training Random Forests or logistic regressions, and then looking at the feature importance across multiple of our segmentation groups, could allow us to evaluate the quality of combined segmentation groups.


```
What would be the advantages/limits of each of these methods? 
```

I see a general tradeoff between specific strategies (often relying on domain experts' intuition) which are easier to develop, and more abstract methods which necessitate more R&D but have holistic potential.

```
You are free to exploit the author database the way you want. You can also generate new tables to develop your approach.
```

I spent a long time mining for more relevant trends so I could better assess segmentation methodology, unfortunately I didn't have time to finish these experiments, and as such present them in a clean way here.